Daily arXiv Papers - 2025-11-18

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

James McCammon

Main category: cs.CL

TL;DR: TimeStampEval is a benchmark for retrieving precise timestamps from long transcripts using non-verbatim quotes. A two-stage method combining RapidFuzz pre-filtering with LLM verification dramatically improves accuracy while reducing costs by over 90%.

DetailsMotivation: Traditional fuzzy matching fails when aligning official written records with speech-to-text transcripts that are semantically identical but syntactically different. The use case is an automated podcast that assembles Congressional Record clips into AI-hosted narration.

Method: Two-stage approach: RapidFuzz pre-filtering followed by LLM verification on short snippets. Evaluated six modern LLMs with optimized prompt design (query before transcript, compact formatting) and reasoning budgets.

Result: Improved fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. With reasoning budgets, accuracy increased from 37% to 77% for weak setups and above 90% for strong ones. Robust across transcript lengths and domains.

Conclusion: The “Assisted Fuzzy” approach effectively solves the timestamp retrieval problem for non-verbatim quotes, with prompt design being more critical than model choice. The method maintains high performance across various transcript characteristics while significantly reducing computational costs.

Abstract: Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our “Assisted Fuzzy” approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.

[2] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu

Main category: cs.CL

TL;DR: MiroThinker v1.0 introduces interaction scaling as a third dimension for improving research agents, enabling up to 600 tool calls per task and achieving state-of-the-art performance across multiple benchmarks.

DetailsMotivation: To advance tool-augmented reasoning beyond simply scaling model size or context length, by exploring interaction scaling at the model level as a new dimension for performance improvement.

Method: Uses reinforcement learning to train models for deeper and more frequent agent-environment interactions, leveraging environment feedback and external information acquisition to correct errors and refine trajectories.

Result: The 72B variant achieves 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, surpassing previous open-source agents and approaching commercial counterparts like GPT-5-high.

Conclusion: Interaction scaling represents a third critical dimension for building next-generation research agents, complementing model capacity and context windows, with performance improving predictably as interaction depth increases.

Abstract: We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

[3] On the Notion that Language Models Reason

Bertram Højer

Main category: cs.CL

TL;DR: The paper argues that language models are not genuine reasoners but statistical pattern matchers that produce reasoning-like outputs through learned statistical regularities rather than explicit logical mechanisms.

DetailsMotivation: To clarify what 'reasoning' means in language models and demonstrate that current definitions are inconsistent with how LMs actually work - trained as statistical pattern matchers rather than logical reasoners.

Method: Analyzes definitions of reasoning in NLP literature and proposes viewing transformer-based LMs as implementing implicit finite-order Markov kernels that map contexts to token distributions, where reasoning-like outputs emerge from statistical regularities.

Result: Shows that LMs produce reasoning-like behavior through statistical pattern matching rather than explicit reasoning mechanisms, explaining why they lack logical consistency guarantees.

Conclusion: The distinction between statistical pattern matching and genuine reasoning is fundamental for evaluating epistemic uncertainty in LMs, and calls for more precise descriptions of computational processes in NLP research.

Abstract: Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an \textit{implicit} finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are “statistical pattern matchers”" and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.

[4] Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis

Hong-Jun Yoon, Faisal Ashraf, Thomas A. Ruggles, Debjani Singh

Main category: cs.CL

TL;DR: Analysis of 7 open-weight LLMs (0.6B-70B parameters) for information extraction from hydropower regulatory documents reveals a 14B parameter threshold where validation becomes effective, with consumer-deployable models achieving 64% F1 and larger models reaching 77% F1.

DetailsMotivation: To address critical trade-offs between performance and computational resources in information extraction from regulatory documents using large language models, and provide empirical deployment guidance.

Method: Evaluated seven open-weight models ranging from 0.6B to 70B parameters on hydropower licensing documentation, analyzing performance scaling and validation effectiveness.

Result: Identified a 14B parameter threshold where validation transitions from ineffective (F1 < 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64% F1, smaller models plateau at 51%, and large-scale models approach 77% F1 but require enterprise infrastructure. Found systematic hallucination patterns in smaller models.

Conclusion: Established the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. Results provide immediate value for hydropower compliance and insights into parameter scaling effects that generalize across information extraction tasks.

Abstract: Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources. We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance. Our analysis identified a pronounced 14B parameter threshold where validation methods transition from ineffective (F1 $<$ 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64% F1 through appropriate validation, while smaller models plateau at 51%. Large-scale models approach 77% F1 but require enterprise infrastructure. We identified systematic hallucination patterns where perfect recall indicates extraction failure rather than success in smaller models. Our findings establish the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. These results provide immediate value for hydropower compliance while contributing insights into parameter scaling effects that generalize across information extraction tasks.

[5] Towards Autoformalization of LLM-generated Outputs for Requirement Verification

Mihir Gupte, Ramesh S

Main category: cs.CL

TL;DR: This paper explores using LLM-based autoformalization to verify LLM-generated outputs against natural language requirements, demonstrating potential for consistency checks and logical verification.

DetailsMotivation: There's currently no formal method to verify if LLM-generated structured outputs from natural language are accurate, creating a gap in ensuring fidelity and logical consistency.

Method: Used a simple LLM-based autoformalizer to verify LLM-generated outputs against natural language requirements through two experiments: checking logical equivalence of differently-worded requirements and identifying logical inconsistencies.

Result: The autoformalizer successfully identified logical equivalence between differently-worded requirements and detected logical inconsistencies between NL requirements and LLM-generated outputs.

Conclusion: Autoformalization shows significant potential for ensuring fidelity and logical consistency of LLM-generated outputs, laying foundation for future extensive studies in this application.

Abstract: Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there’s currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline’s potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.

[6] Three Stage Narrative Analysis; Plot-Sentiment Breakdown, Structure Learning and Concept Detection

Taimur Khan, Ramoza Ahsan, Mohib Hameed

Main category: cs.CL

TL;DR: A framework for analyzing movie scripts using sentiment arcs and character context, employing dictionary-based sentiment analysis with a custom lexicon and clustering similar sentiment plots.

DetailsMotivation: Automated narrative analysis is needed due to the large volume of narrative data and the limitations of manual approaches in story understanding.

Method: Uses dictionary-based sentiment analysis with a custom lexicon built from NRC-VAD dataset scores, and applies Wards hierarchical clustering to group similar sentiment plots.

Result: Experimental evaluation shows the analysis is helpful for consumers and readers when selecting narratives or stories.

Conclusion: The proposed framework effectively analyzes movie scripts through sentiment arcs and character context, providing valuable insights for narrative selection.

Abstract: Story understanding and analysis have long been challenging areas within Natural Language Understanding. Automated narrative analysis requires deep computational semantic representations along with syntactic processing. Moreover, the large volume of narrative data demands automated semantic analysis and computational learning rather than manual analytical approaches. In this paper, we propose a framework that analyzes the sentiment arcs of movie scripts and performs extended analysis related to the context of the characters involved. The framework enables the extraction of high-level and low-level concepts conveyed through the narrative. Using dictionary-based sentiment analysis, our approach applies a custom lexicon built with the LabMTsimple storylab module. The custom lexicon is based on the Valence, Arousal, and Dominance scores from the NRC-VAD dataset. Furthermore, the framework advances the analysis by clustering similar sentiment plots using Wards hierarchical clustering technique. Experimental evaluation on a movie dataset shows that the resulting analysis is helpful to consumers and readers when selecting a narrative or story.

[7] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn

Main category: cs.CL

TL;DR: This paper introduces a radiology report corpus for evaluating LLMs on follow-up adherence detection, showing that optimized GPT models achieve near-human performance while traditional ML models remain strong baselines.

DetailsMotivation: There is a lack of domain-specific datasets to rigorously evaluate LLMs on radiology tasks, particularly for follow-up imaging adherence detection.

Method: Created annotated corpus of 6,393 radiology reports; compared traditional ML classifiers (LR, SVM, Longformer) with fine-tuned Llama3-8B and generative LLMs (GPT-4o, GPT-OSS-20B) using baseline and task-optimized configurations with refined prompts.

Result: GPT-4o (Advanced) achieved best performance (F1=0.832), GPT-OSS-20B (Advanced) close second (F1=0.828), LR and SVM also strong (F1=0.776, 0.775); high inter-annotator agreement (F1=0.846).

Conclusion: While optimized LLMs approach human-level performance, interpretable and resource-efficient traditional models remain valuable baselines for radiology follow-up adherence detection.

Abstract: Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

[8] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Fernanda Bufon Färber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: MedPT is the first large-scale Brazilian Portuguese medical corpus with 384K patient-doctor Q&A pairs, enabling culturally-aware medical AI development for Portuguese speakers through specialized dataset curation and validation.

DetailsMotivation: Address the gap in medical LLM development for non-English languages, particularly Brazilian Portuguese, where simple translation fails to capture clinical and cultural nuances like endemic diseases.

Method: Created MedPT corpus through multi-stage curation with hybrid quantitative-qualitative analysis, LLM-driven annotation classifying questions into 7 semantic types, and filtering noise while contextually enriching ambiguous queries.

Result: Achieved 94% F1-score on 20-class medical specialty routing task using fine-tuned 1.7B parameter model, with error analysis showing misclassifications reflect genuine clinical ambiguities rather than random errors.

Conclusion: MedPT enables development of equitable, accurate, and culturally-aware medical technologies for Portuguese-speaking populations, proving the dataset’s semantic richness and utility for specialized medical AI applications.

Abstract: While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset’s deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

[9] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

Karthikeyan K, Raghuveer Thirukovalluru, David Carlson

Main category: cs.CL

TL;DR: ClinStructor converts clinical free-text into structured Q&A pairs using LLMs to address bias, poor generalization, and interpretability issues in clinical NLP.

DetailsMotivation: Clinical notes contain valuable but unstructured information that introduces biases, poor generalization across EHR systems, and limited interpretability in predictive modeling.

Method: Leverage large language models to convert clinical free-text into structured, task-specific question-answer pairs before predictive modeling.

Result: Substantially enhances transparency and controllability with only modest performance reduction (2-3% AUC drop) on ICU mortality prediction compared to direct fine-tuning.

Conclusion: ClinStructor provides a foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

Abstract: Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question-answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2-3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

[10] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

Eric Hua Qing Zhang, Julia Ive

Main category: cs.CL

TL;DR: Fine-tuning GPT-2 with reinforcement learning significantly improves therapeutic dialogue generation, achieving 99.34% emotion accuracy and better alignment with professional therapist responses.

DetailsMotivation: Mental health accessibility challenges increased during COVID-19, creating demand for telehealth solutions. Pre-trained LLMs lack contextual and emotional awareness needed for appropriate therapeutic responses.

Method: Applied supervised fine-tuning and reinforcement learning to GPT-2, restructuring input formats to process context and emotional states simultaneously, using multi-component reward functions aligned with therapist responses and annotated emotions.

Result: Reinforcement learning outperformed baseline GPT-2 across all metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), METEOR (0.0581), and achieved 99.34% emotion accuracy vs 66.96% for baseline.

Conclusion: Reinforcement learning effectively develops therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.

Abstract: Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2’s capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning’s effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.

[11] Additive Large Language Models for Semi-Structured Text

Karthikeyan K, Raghuveer Thirukovalluru, David Carlson

Main category: cs.CL

TL;DR: CALM is an interpretable framework for clinical text classification that uses additive LLMs to provide faithful explanations by summing component contributions from semi-structured text.

DetailsMotivation: To address the opacity of LLM predictions in clinical settings where understanding which parts of patient records drive risk signals is crucial for practical adoption.

Method: Uses additive LLMs where inputs are semantically meaningful components (e.g., note sections, intake form fields) and predicts outcomes as the sum of each component’s contribution.

Result: Achieves performance comparable to conventional LLM classifiers while providing interpretable component-level risk curves and enabling quality-assurance checks.

Conclusion: CALM improves trust in clinical text classification by making model predictions interpretable at both patient and population levels, revealing clinically meaningful patterns.

Abstract: Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient’s record drive risk signals. To address this challenge, we introduce \textbf{CALM}, short for \textbf{Classification with Additive Large Language Models}, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component’s contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.

[12] Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, Katalin Mády

Main category: cs.CL

TL;DR: Two new Hungarian speech datasets (BEA-Large and BEA-Dialogue) address the lack of spontaneous and conversational corpora, with baseline ASR and diarization results showing persistent challenges in conversational speech.

DetailsMotivation: Hungarian is underrepresented in ASR research due to limited spontaneous and conversational corpora, creating a gap compared to high-resource languages.

Method: Created BEA-Large (255h spontaneous speech) and BEA-Dialogue (85h natural conversations) from unprocessed portions of Hungarian BEA corpus, with detailed metadata and speaker-independent partitions.

Result: Fine-tuned Fast Conformer achieved 14.18% WER on spontaneous speech and 4.8% on repeated speech; diarization error rates between 13.05%-18.26%.

Conclusion: Conversational ASR remains challenging due to disfluencies, overlaps, and informal speech; released datasets provide framework for developing spontaneous speech benchmarks in other languages.

Abstract: The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets – BEA-Large and BEA-Dialogue – constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18% on spontaneous and 4.8% on repeated speech. Diarization experiments yield diarization error rates between 13.05% and 18.26%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.

[13] InData: Towards Secure Multi-Step, Tool-Based Data Analysis

Karthikeyan K, Raghuveer Thirukovalluru, Bhuwan Dhingra, David Edwin Carlson

Main category: cs.CL

TL;DR: Proposes InData dataset to evaluate LLMs’ multi-step tool-based reasoning for secure data analysis, showing current models struggle with complex tasks despite good performance on simple ones.

DetailsMotivation: Address security risks of LLMs directly generating and executing code on sensitive databases by restricting them to use predefined secure tools instead.

Method: Introduce Indirect Data Engagement (InData) dataset with data analysis questions at three difficulty levels (Easy, Medium, Hard) to benchmark LLMs’ multi-step tool-based reasoning capabilities.

Result: Benchmarked 15 open-source LLMs showing high accuracy on Easy tasks (97.3%) but significant performance drop on Hard tasks (69.6%), indicating lack of robust multi-step reasoning.

Conclusion: Current LLMs lack strong multi-step tool-based reasoning abilities, and InData enables development and evaluation of improved models for secure data analysis.

Abstract: Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs’ multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels–Easy, Medium, and Hard–capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.

[14] Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

Hadi Sheikhi, Chenyang Huang, Osmar R. Zaïane

Main category: cs.CL

TL;DR: LLMs struggle with knowledge attachment in KG-DG tasks, so the paper introduces LLM-KAT evaluation and entity anonymization to improve external knowledge utilization.

DetailsMotivation: LLMs often rely on internal knowledge instead of provided knowledge graphs in dialogue generation, leading to detachment from external knowledge sources.

Method: Proposed LLM-KAT evaluation procedure for measuring knowledge attachment, and introduced entity anonymization technique to encourage better utilization of external knowledge.

Result: Experiments on OpenDialKG dataset show improved knowledge attachment in LLMs when using the proposed approach.

Conclusion: The simple entity anonymization technique effectively helps LLMs better leverage external knowledge in knowledge graph-based dialogue generation tasks.

Abstract: Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs’ attachment on external knowledge.

[15] On the Entropy Calibration of Language Models

Steven Cao, Gregory Valiant, Percy Liang

Main category: cs.CL

TL;DR: Language models are miscalibrated with entropy increasing over longer generations. Scaling doesn’t fix this issue - larger models accumulate error at similar rates as smaller ones. Truncation improves quality but reduces diversity. The paper proves it’s theoretically possible to reduce entropy while preserving log loss using a black box that predicts future entropy.

DetailsMotivation: To understand if language model miscalibration improves with scale and whether it's theoretically possible to calibrate without tradeoffs between text quality and diversity.

Method: First studied a simplified theoretical setting to characterize scaling behavior of miscalibration with respect to dataset size, then empirically measured miscalibration in language models ranging from 0.5B to 70B parameters.

Result: Found that miscalibration scaling behavior depends on power law exponent of data distribution - for exponents close to 1, scaling exponent is close to 0, meaning miscalibration improves very slowly with scale. Empirical results matched theoretical predictions.

Conclusion: Larger models accumulate error at similar rates as smaller ones, explaining why we use similar truncation levels. However, the paper proves it’s theoretically possible to reduce entropy while preserving log loss using a black box that can predict future entropy of text.

Abstract: We study the problem of entropy calibration, which asks whether a language model’s entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution – in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

[16] A Reasoning Paradigm for Named Entity Recognition

Hui Huang, Yanping Chen, Ruizhang Huang, Chuan Lin, Yongbin Qin

Main category: cs.CL

TL;DR: ReasoningNER introduces a reasoning framework for NER that shifts from implicit pattern matching to explicit reasoning, achieving SOTA performance in zero-shot settings and outperforming GPT-4 by 12.3 F1 points.

DetailsMotivation: Current generative LLMs for NER rely on semantic pattern matching without explicit reasoning mechanisms, leading to suboptimal performance and brittle generalization, especially in zero-shot and low-resource scenarios.

Method: Three-stage framework: 1) Generate NER-oriented Chain of Thought (CoT) datasets with reasoning chains, 2) Tune NER model to generate rationales before final answers, 3) Enhance reasoning process using comprehensive reward signals for verifiable extractions.

Result: Achieves competitive performance overall and SOTA in zero-shot settings, outperforming GPT-4 by 12.3 percentage points on F1 score. Demonstrates impressive cognitive ability and potential for reasoning-oriented information extraction.

Conclusion: The reasoning framework successfully addresses cognitive shortcutting in NER by enabling explicit, verifiable reasoning, showing great promise for advancing research in reasoning-oriented information extraction.

Abstract: Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This “cognitive shortcutting” leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.

[17] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

Eunkyu Park, Wesley Hanwen Deng, Vasudha Varadarajan, Mingxi Yan, Gunhee Kim, Maarten Sap, Motahhare Eslami

Main category: cs.CL

TL;DR: CoT explanations can both clarify and mislead in moral scenarios, as users often trust outputs that agree with them regardless of reasoning flaws, and confident tones suppress error detection while maintaining reliance.

DetailsMotivation: To study the double-edged role of Chain-of-Thought explanations in fostering both transparency and confirmation bias in multimodal moral scenarios.

Method: Systematically perturbing reasoning chains and manipulating delivery tones in vision language models, then analyzing how reasoning errors impact user trust and error detection ability.

Result: Users equate trust with outcome agreement (sustaining reliance despite flawed reasoning), and confident tones suppress error detection while maintaining reliance (delivery style overrides correctness).

Conclusion: CoT explanations can simultaneously clarify and mislead, highlighting the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust.

Abstract: Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.

[18] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for “Thick” Culture Alignment Evaluation in LLMs

Truong Vo, Sanmi Koyejo

Main category: cs.CL

TL;DR: This paper introduces a new benchmark for evaluating cultural competence in LLMs using realistic situational contexts and multiple metrics, showing that traditional evaluation methods overestimate cultural competence while their approach provides more stable and interpretable assessments.

DetailsMotivation: Existing evaluations of cultural competence in LLMs are limited, focusing on de-contextualized correctness or forced-choice judgments, which overlook the need for cultural understanding and reasoning required for appropriate responses in culturally diverse environments.

Method: The authors introduce benchmarks with realistic situational contexts requiring culturally grounded reasoning, and use four complementary metrics (Coverage, Specificity, Connotation, and Coherence) alongside standard Exact Match to capture different dimensions of response quality.

Result: Empirical analysis shows that traditional ’thin’ evaluation systematically overestimates cultural competence and produces unstable assessments with high variance, while their ’thick’ evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals.

Conclusion: Thick evaluation with realistic contexts and multiple metrics provides more accurate and stable assessment of cultural competence in LLMs compared to traditional evaluation methods.

Abstract: Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model’s response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

[19] Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task

Felipe Fujita, Hideyuki Takada

Main category: cs.CL

TL;DR: Combining backtranslation and fine-tuning on small Japanese corpus significantly improves English→Japanese neural machine translation quality, outperforming either technique alone.

DetailsMotivation: To enhance neural machine translation for low-resource language pairs like English-Japanese using limited training data through synergistic techniques.

Method: First apply backtranslation using synthetic data from monolingual Japanese corpora, then fine-tune on genuine small parallel dataset, and finally combine both by augmenting small dataset with BT examples before fine-tuning.

Result: Baseline COMET=0.460 → BT alone=0.468 → FT alone=0.589 → Combined BT+FT=0.597, showing significant improvement over individual techniques.

Conclusion: The synergistic combination of backtranslation and targeted fine-tuning offers a lightweight yet powerful strategy for improving low-resource language pairs even with limited training data.

Abstract: In this paper, we explore the effectiveness of combining fine-tuning and backtranslation on a small Japanese corpus for neural machine translation. Starting from a baseline English{\textrightarrow}Japanese model (COMET = 0.460), we first apply backtranslation (BT) using synthetic data generated from monolingual Japanese corpora, yielding a modest increase (COMET = 0.468). Next, we fine-tune (FT) the model on a genuine small parallel dataset drawn from diverse Japanese news and literary corpora, achieving a substantial jump to COMET = 0.589 when using Mistral 7B. Finally, we integrate both backtranslation and fine-tuning{ – }first augmenting the small dataset with BT generated examples, then adapting via FT{ – }which further boosts performance to COMET = 0.597. These results demonstrate that, even with limited training data, the synergistic use of backtranslation and targeted fine-tuning on Japanese corpora can significantly enhance translation quality, outperforming each technique in isolation. This approach offers a lightweight yet powerful strategy for improving low-resource language pairs.

[20] LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models

Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski

Main category: cs.CL

TL;DR: LLMLagBench is a benchmark to identify LLMs’ temporal knowledge boundaries by testing their awareness of recent events, helping detect when models might blend outdated information with current knowledge.

DetailsMotivation: LLMs have fixed training cutoffs, creating knowledge boundaries that can lead to inaccurate responses when models unknowingly use outdated time-sensitive information during reasoning.

Method: Developed LLMLagBench to systematically evaluate LLMs’ knowledge of recent events and identify the earliest probable temporal boundaries of their training data.

Result: Applied the benchmark to evaluate various LLMs with both declared and undeclared training cutoffs, with reliability assessed through manual validation and comparison with publicly available pretraining information.

Conclusion: LLMLagBench provides a systematic approach to identify LLMs’ temporal knowledge limitations, helping address the problem of outdated information being used in reasoning tasks.

Abstract: Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM’s training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.

[21] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng Xu

Main category: cs.CL

TL;DR: U-MStance is the first user-centric multimodal stance detection dataset addressing limitations of pseudo-multimodality and user homogeneity. PRISM model incorporates user personas and multimodal alignment for improved stance detection.

DetailsMotivation: Existing multimodal stance detection suffers from pseudo-multimodality (visual cues only in source posts) and user homogeneity (ignoring personal traits), limiting real-world applicability.

Method: PRISM model derives user personas from historical data, aligns textual/visual cues via Chain-of-Thought reasoning, and uses mutual task reinforcement for joint stance detection and response generation.

Result: Experiments on U-MStance dataset show PRISM achieves significant performance gains over strong baselines in multimodal conversational stance detection.

Conclusion: User-centric and context-grounded multimodal reasoning is essential for realistic stance understanding in social media conversations.

Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users’ attitudes toward specific targets within complex discussions. However, existing studies remain limited by: 1) pseudo-multimodality, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and 2) user homogeneity, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce U-MStance, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose PRISM, a Persona-Reasoned multImodal Stance Model for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.

[22] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, Xingxing Wang

Main category: cs.CL

TL;DR: AI-Salesman is a novel framework for goal-driven persuasive dialogue that uses dual-stage architecture with Bayesian-supervised reinforcement learning and dynamic outline-guided inference to overcome LLM limitations in sales scenarios.

DetailsMotivation: Goal-driven persuasive dialogue requires sophisticated multi-turn planning and strict factual faithfulness, which current LLMs struggle with due to strategic brittleness and factual hallucination, especially in data-scarce domains like telemarketing.

Method: Constructed TeleSalesCorpus dataset, then proposed AI-Salesman with dual-stage architecture: training stage uses Bayesian-supervised reinforcement learning from noisy dialogues, and inference stage uses Dynamic Outline-Guided Agent (DOGA) with pre-built script library for turn-by-turn strategic guidance.

Result: AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, demonstrating effectiveness in complex persuasive scenarios.

Conclusion: The proposed framework successfully addresses challenges in goal-driven persuasive dialogue by combining robust training with dynamic inference guidance, showing promising results for real-world sales applications.

Abstract: Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.

[23] Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Pinxue Guo, Chongruo Wu, Xinyu Zhou, Lingyi Hong, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Sen-ching Samson Cheung, Wei Zhang, Wenqiang Zhang

Main category: cs.CL

TL;DR: VBackChecker is a reference-free hallucination detection framework for MLLMs that verifies response consistency with visual inputs using pixel-level Grounding LLM with reasoning and segmentation capabilities, achieving SOTA performance.

DetailsMotivation: MLLMs suffer from hallucinations, making accurate detection crucial for reliability in practical applications. The principle "Seeing is Believing" guides the need to verify consistency between generated responses and visual inputs.

Method: Uses a pixel-level Grounding LLM with reasoning and referring segmentation capabilities. Includes an innovative pipeline for generating instruction-tuning data (R-Instruct) with rich-context descriptions, grounding masks, and hard negative samples.

Result: Outperforms prior complex frameworks and achieves SOTA on R^2-HalBench benchmark, rivaling GPT-4o’s capabilities. Surpasses prior methods in pixel-level grounding with over 10% improvement.

Conclusion: VBackChecker provides an effective reference-free hallucination detection framework with interpretability that handles rich-context scenarios well, demonstrating superior performance compared to existing methods.

Abstract: Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of “Seeing is Believing”, we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o’s capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at https://github.com/PinxueGuo/VBackChecker.

[24] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, Dongbin Zhao

Main category: cs.CL

TL;DR: CriticSearch introduces a fine-grained credit-assignment framework using retrospective critic mechanism to provide dense, turn-level feedback for search agents, overcoming sparse reward issues in reinforcement learning.

DetailsMotivation: Existing search agent pipelines relying on reinforcement learning suffer from sparse outcome rewards, leading to inefficient exploration and unstable training in complex question-answering tasks.

Method: Uses a frozen, asymmetric critique LLM that retrospectively evaluates each turn using privileged information from full trajectory and gold answers, converting assessments into stable, dense rewards to guide policy improvement.

Result: Experimental results across diverse multi-hop reasoning benchmarks show CriticSearch consistently outperforms existing baselines with faster convergence, improved training stability, and higher performance.

Conclusion: CriticSearch effectively addresses sparse reward problems in search agent training through fine-grained credit assignment, demonstrating superior performance and stability in complex reasoning tasks.

Abstract: Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.

[25] MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues

Liang Xue, Haoyu Liu, Yajun Tian, Xinyu Zhong, Yang Liu

Main category: cs.CL

TL;DR: MME-RAG is a Multi-Manager-Expert Retrieval-Augmented Generation framework that improves fine-grained entity recognition in task-oriented dialogues through hierarchical decomposition and semantic retrieval.

DetailsMotivation: Current LLMs struggle with domain adaptation and retrieval controllability in fine-grained entity recognition for task-oriented dialogues.

Method: Decomposes entity recognition into type-level judgment by lightweight managers and span-level extraction by specialized experts, with KeyInfo retrievers injecting semantically aligned few-shot exemplars during inference.

Result: Outperforms recent baselines in most domains on CrossNER, MIT-Movie, MIT-Restaurant, and a new multi-domain customer-service dataset.

Conclusion: MME-RAG provides a scalable and interpretable solution for adaptive dialogue understanding, with hierarchical decomposition and KeyInfo-guided retrieval being key to robustness and cross-domain generalization.

Abstract: Fine-grained entity recognition is crucial for reasoning and decision-making in task-oriented dialogues, yet current large language models (LLMs) continue to face challenges in domain adaptation and retrieval controllability. We introduce MME-RAG, a Multi-Manager-Expert Retrieval-Augmented Generation framework that decomposes entity recognition into two coordinated stages: type-level judgment by lightweight managers and span-level extraction by specialized experts. Each expert is supported by a KeyInfo retriever that injects semantically aligned, few-shot exemplars during inference, enabling precise and domain-adaptive extraction without additional training. Experiments on CrossNER, MIT-Movie, MIT-Restaurant, and our newly constructed multi-domain customer-service dataset demonstrate that MME-RAG performs better than recent baselines in most domains. Ablation studies further show that both the hierarchical decomposition and KeyInfo-guided retrieval are key drivers of robustness and cross-domain generalization, establishing MME-RAG as a scalable and interpretable solution for adaptive dialogue understanding.

[26] Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

Raavi Gupta, Pranav Hari Panicker, Sumit Bhatia, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: CONFACTCHECK is an efficient hallucination detection method for LLMs that uses internal consistency checks without external knowledge bases, reducing API calls and costs.

DetailsMotivation: LLMs often generate factually incorrect text (hallucinations) which poses risks in critical domains, and existing detection methods require multiple costly API calls when model access is restricted.

Method: Uses internal consistency checks by probing factual statements within generated text and comparing responses across the same LLM and different LLMs, without external knowledge bases.

Result: Achieves higher accuracy than existing baselines while using fewer resources, validated on multiple datasets covering factual text generation and open generation tasks.

Conclusion: CONFACTCHECK provides an efficient and effective solution for hallucination detection in constrained settings where model fine-tuning or external knowledge bases are not available.

Abstract: Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.

[27] ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Khang T. Huynh, Dung H. Nguyen, Binh T. Nguyen

Main category: cs.CL

TL;DR: ViConBERT is a Vietnamese contextual embedding framework using contrastive learning and gloss-based distillation, achieving state-of-the-art performance on WSD and contextual similarity tasks.

DetailsMotivation: Vietnamese lacks robust models and evaluation resources for fine-grained semantic understanding compared to high-resource languages like English.

Method: Integrates contrastive learning (SimCLR) and gloss-based distillation to learn Vietnamese contextualized embeddings, and creates ViConWSD - the first large-scale synthetic dataset for Vietnamese semantic evaluation.

Result: Outperforms baselines on WSD (F1=0.87), achieves competitive performance on ViCon (AP=0.88) and ViSim-400 (Spearman’s rho=0.60).

Conclusion: ViConBERT effectively models both discrete senses and graded semantic relations in Vietnamese, providing a strong foundation for Vietnamese semantic understanding.

Abstract: Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman’s rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT

[28] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

Ivan Zakazov, Alexander Sharipov, Berke Argin, Oussama Gabouj, Kamel Charaf, Alexi Semiz, Lorenzo Drudi, Nicolas Baldwin, Robert West

Main category: cs.CL

TL;DR: Introduces a novel LLM prompt compression paradigm using smaller LLMs to compress inputs for larger ones, with comprehensive benchmarking and development of Cmprsr model that outperforms existing methods.

DetailsMotivation: High costs of using black-box Large Language Models (LLMs) motivate the need for efficient prompt compression to reduce computational expenses.

Method: Comprehensive benchmarking of 25 models, Textgrad-based meta-prompt optimization for gpt-4.1-mini, and post-training Qwen3-4B with SFT and GRPO for dual objectives of compression rate adherence and task performance.

Result: Cmprsr model demonstrates superiority over extractive and vanilla abstractive compression across all compression rates on lengthy inputs and short prompts, with fine control over cost-quality trade-off.

Conclusion: The proposed compression paradigm and Cmprsr model effectively address LLM cost issues while maintaining performance, showing generalizability across input lengths and domains.

Abstract: Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models’ compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr’s generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.

[29] AugAbEx : Way Forward for Extractive Case Summarization

Purnima Bindal, Vikas Kumar, Sagar Rathore, Vasudha Bhatnagar

Main category: cs.CL

TL;DR: This paper creates extractive gold standard summaries from existing abstractive summaries for legal case summarization, addressing the annotation cost issue and providing enriched datasets for the research community.

DetailsMotivation: Legal document summarization is challenging due to complex language and jargon, and abstractive summaries risk misrepresenting legal nuances. There's a trend toward extractive summarizers, but creating gold standard extractive summaries is expensive.

Method: Engineered a light pipeline that transforms existing abstractive gold standard summaries into corresponding extractive versions, preserving expert opinions from the original summaries.

Result: Augmented seven existing case summarization datasets with extractive summaries and performed extensive evaluation covering structural, lexical, semantic dimensions and domain-level information.

Conclusion: The enriched datasets will be released publicly to advance automatic legal document summarization research, providing valuable resources for the community.

Abstract: Summarization of legal judgments poses a heavy cognitive burden on law practitioners due to the complexity of the language, context-sensitive legal jargon, and the length of the document. Therefore, the automatic summarization of legal documents has attracted serious attention from natural language processing researchers. Since the abstractive summaries of legal documents generated by deep neural methods remain prone to the risk of misrepresenting nuanced legal jargon or overlooking key contextual details, we envisage a rising trend toward the use of extractive case summarizers. Given the high cost of human annotation for gold standard extractive summaries, we engineer a light and transparent pipeline that leverages existing abstractive gold standard summaries to create the corresponding extractive gold standard versions. The approach ensures that the experts` opinions ensconced in the original gold standard abstractive summaries are carried over to the transformed extractive summaries. We aim to augment seven existing case summarization datasets, which include abstractive summaries, by incorporating corresponding extractive summaries and create an enriched data resource for case summarization research community. To ensure the quality of the augmented extractive summaries, we perform an extensive comparative evaluation with the original abstractive gold standard summaries covering structural, lexical, and semantic dimensions. We also compare the domain-level information of the two summaries. We commit to release the augmented datasets in the public domain for use by the research community and believe that the resource will offer opportunities to advance the field of automatic summarization of legal documents.

[30] Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

Naoya Sugiura, Kosuke Yamada, Yasuhiro Ogawa, Katsuhiko Toyama, Ryohei Sasano

Main category: cs.CL

TL;DR: LLMs struggle more than humans with quizzes not covered by Wikipedia and those requiring numerical answers.

DetailsMotivation: To investigate whether problems difficult for humans are also difficult for LLMs, specifically in quiz settings.

Method: Collected Japanese quiz data with questions, answers, and human correct response rates; prompted LLMs to answer under various settings and compared performance.

Result: LLMs perform worse than humans on quizzes not covered by Wikipedia entries and those requiring numerical answers.

Conclusion: LLMs have different difficulty patterns than humans, struggling with knowledge gaps and numerical reasoning.

Abstract: LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.

[31] Don’t Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, Kevin Zhu

Main category: cs.CL

TL;DR: This paper investigates ironic rebound in LLMs - where negation instructions paradoxically increase accessibility of forbidden concepts, similar to human cognition. The study examines how different types of distractors affect rebound strength and whether models distinguish neutral vs negative framings.

DetailsMotivation: To understand how LLMs handle negation instructions and whether they exhibit the same ironic rebound phenomenon observed in human cognition, where suppressing a concept actually makes it more accessible.

Method: Two experiments: (1) varying distractor text types (semantic, syntactic, repetition) after negation to measure rebound strength; (2) testing polarity separation between neutral and negative framings. Also includes circuit tracing analysis to identify neural mechanisms.

Result: Rebound consistently occurs immediately after negation, intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Circuit analysis shows middle-layer attention heads amplify forbidden tokens while early layers suppress.

Conclusion: The findings link cognitive predictions of ironic rebound with mechanistic insights into long-context interference in LLMs, and the authors release ReboundBench dataset for future research.

Abstract: Negation instructions such as ‘do not mention $X$’ can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load & content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.

[32] From Phonemes to Meaning: Evaluating Large Language Models on Tamil

Jeyarajalingam Varsha, Menan Velayuthan, Sumirtha Karunakaran, Rasan Nivethiga, Kengatharaiyer Sarveswaran

Main category: cs.CL

TL;DR: ILAKKANAM is the first Tamil-specific linguistic evaluation benchmark using 820 questions from Sri Lankan school exams, revealing that LLMs perform well on simple Tamil questions but struggle with complex linguistic tasks, with no correlation between overall performance and genuine linguistic understanding.

DetailsMotivation: LLMs' linguistic competence in low-resource, morphologically rich languages like Tamil remains unexplored, and existing multilingual benchmarks using translated English datasets fail to capture linguistic and cultural nuances.

Method: Created ILAKKANAM benchmark with 820 manually curated questions from Sri Lankan school Tamil exams, annotated by linguists across 5 linguistic categories and factual knowledge, covering Grades 1-13. Evaluated both closed-source and open-source LLMs using standardized framework.

Result: Gemini 2.5 achieved highest overall performance; open-source models lagged. All models performed well on lower-grade questions but declined as linguistic complexity increased. No strong correlation found between overall performance and ability to identify linguistic categories.

Conclusion: LLMs’ performance in Tamil appears driven by exposure rather than genuine linguistic understanding, highlighting the need for better linguistic grounding in low-resource languages.

Abstract: Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1–13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model’s overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.

[33] Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Chenglong Wang, Yifu Huo, Yang Gan, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Anxiang Ma, Zhengtao Yu, Jingbo Zhu, Tong Xiao

Main category: cs.CL

TL;DR: This paper introduces MRMBench, a multi-dimensional reward model benchmark with probing tasks to evaluate reward models across different preference dimensions, and proposes inference-time probing to enhance interpretability.

DetailsMotivation: Previous reward model evaluation methods use fixed pairwise ranking tests but fail to provide performance information on individual preference dimensions, limiting understanding of reward model capabilities.

Method: Constructed MRMBench with six probing tasks for different preference dimensions, and introduced inference-time probing to identify dimensions used during reward prediction and assess prediction confidence.

Result: MRMBench strongly correlates with LLM alignment performance, revealing that reward models often struggle with multi-dimensional preferences, and inference-time probing provides reliable confidence assessment.

Conclusion: MRMBench serves as a reliable benchmark for developing advanced reward models, highlighting the need for multi-objective optimization in reward modeling, while inference-time probing improves interpretability and alignment performance.

Abstract: Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.

[34] Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing

Mengying Wang, Chenhui Ma, Ao Jiao, Tuo Liang, Pengjun Lu, Shrinidhi Hegde, Yu Yin, Evren Gurkan-Cavusoglu, Yinghui Wu

Main category: cs.CL

TL;DR: SerenQA framework evaluates LLMs’ ability to find surprising and novel answers in knowledge graph question answering, revealing current models struggle with serendipity despite good retrieval performance.

DetailsMotivation: Existing KGQA systems focus on predictable answers, lacking the capacity to suggest surprising and novel ('serendipitous') insights that could lead to valuable discoveries like drug repurposing.

Method: Proposed SerenQA framework with serendipity metric (relevance, novelty, surprise), expert-annotated benchmark from Clinical Knowledge Graph, and structured pipeline with three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration.

Result: State-of-the-art LLMs perform well on retrieval but struggle to identify genuinely surprising and valuable discoveries, showing significant room for improvement in serendipity detection.

Conclusion: The study highlights the gap in LLMs’ ability to uncover unexpected insights and provides a framework for evaluating and improving serendipity-aware KGQA systems.

Abstract: Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel (“serendipitious”) answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs’ ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.

[35] SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok

Main category: cs.CL

TL;DR: SGuard-v1 is a lightweight safety system for LLMs with two specialized models: ContentFilter for detecting harmful content and JailbreakFilter for screening adversarial prompts, achieving state-of-the-art safety performance.

DetailsMotivation: To provide effective safety guardrails for LLMs in human-AI conversations by detecting harmful content and adversarial attacks while being lightweight and interpretable.

Method: Built on 2B-parameter Granite-3.3-2B-Instruct model with instruction tuning on ~1.4M training instances. Uses two specialized components: ContentFilter trained on MLCommons hazard taxonomy and JailbreakFilter trained with curriculum learning covering 60 attack types.

Result: Achieves state-of-the-art safety performance on public and proprietary benchmarks while remaining lightweight. Provides multi-class safety predictions with binary confidence scores for improved interpretability.

Conclusion: SGuard-v1 offers an effective, lightweight safety solution for LLMs that reduces deployment overhead and improves interpretability, released under Apache-2.0 License to enable further research and deployment.

Abstract: We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

[36] QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs

Maria Tseytlin, Paul Roit, Omri Abend, Ido Dagan, Ayal Klein

Main category: cs.CL

TL;DR: QA-Noun is a QA-based framework that captures noun-centered semantic relations using 9 question templates, complementing existing QA-SRL for verbal relations to achieve comprehensive fine-grained semantic decomposition.

DetailsMotivation: Existing QA-based semantic approaches effectively model predicate-argument relations but largely ignore noun-centered semantics, creating a gap in comprehensive semantic alignment.

Method: Defines 9 question templates covering explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that integrate with QA-SRL for unified sentence decomposition.

Result: Achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextual relations. Combined with QA-SRL, yields over 130% higher granularity than FactScore and DecompScore.

Conclusion: QA-Noun complements the broader QA-based semantic framework, providing a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

Abstract: Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

[37] TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

Jie Zhang, Bo Tang, Wanzi Shao, Wenqiang Wei, Jihao Zhao, Jianqing Zhu, Zhiyu li, Wen Xi, Zehao Lin, Feiyu Xiong, Yanchao Tan

Main category: cs.CL

TL;DR: TAdaRAG is a novel RAG framework that addresses information loss and irrelevant details in traditional RAG by constructing task-adaptive knowledge graphs on-the-fly using intent-driven routing and extraction mechanisms.

DetailsMotivation: Traditional RAG suffers from information loss due to chunk truncation and irrelevant details from unstructured knowledge retrieval, leading to hallucinations and broken reasoning chains.

Method: Uses intent-driven routing to domain-specific extraction templates, supervised fine-tuning, and reinforcement learning-based implicit extraction for concise, coherent knowledge integration.

Result: Outperforms existing methods on six public benchmarks and real-world NowNewsQA across three backbone models, showing strong generalization across domains and long-text tasks.

Conclusion: TAdaRAG effectively addresses RAG limitations through task-adaptive knowledge graph construction, demonstrating practical effectiveness and strong generalization capabilities.

Abstract: Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.

[38] Mitigating Length Bias in RLHF through a Causal Lens

Hyeonji Kim, Sujeong Oh, Sanghack Lee

Main category: cs.CL

TL;DR: Proposes a causal framework with counterfactual data augmentation to mitigate length bias in RLHF reward models, where models favor longer responses by conflating verbosity with quality.

DetailsMotivation: RLHF-trained reward models exhibit systematic length bias, favoring longer responses by conflating verbosity with quality, which undermines accurate preference learning.

Method: Uses counterfactual data augmentation with two types of response pairs: length-divergent pairs with similar content, and content-divergent pairs of similar length, to train reward models to assess content quality independently of verbosity.

Result: Empirical evaluations show reduced length bias in reward assignment and more concise, content-focused outputs from policy models.

Conclusion: The approach effectively reduces length bias and improves robustness and content sensitivity of reward modeling in RLHF pipelines.

Abstract: Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias – a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.

[39] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu

Main category: cs.CL

TL;DR: The paper introduces MMWOZ, a multimodal dialogue dataset extending MultiWOZ 2.3 with GUI interactions, and proposes MATE as a baseline model for practical multimodal task-oriented dialogue systems.

DetailsMotivation: To bridge the gap between traditional task-oriented dialogue systems and real-world scenarios where front-end GUIs are prevalent but customized back-end APIs are often unavailable.

Method: 1) Developed a web-style GUI as front-end, 2) Created automated script to convert dialogue states and system actions into GUI operation instructions, 3) Collected web page snapshots with corresponding operation instructions, 4) Proposed MATE multimodal baseline model.

Result: Created MMWOZ dataset with GUI interactions and established baseline performance using the MATE model for multimodal task-oriented dialogue.

Conclusion: The work successfully addresses the practical gap in task-oriented dialogue systems by introducing multimodal capabilities and provides a foundation for developing more realistic dialogue agents that can interact with GUI interfaces.

Abstract: Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

[40] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

Oron Anschel, Alon Shoshan, Adam Botach, Shunit Haviv Hakimi, Asaf Gendler, Emanuel Ben Baruch, Nadav Bhonker, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni

Main category: cs.CL

TL;DR: GAPO is a group-aware policy optimization method that addresses mode collapse in LLMs by computing rewards over groups of completions, improving diversity without sacrificing accuracy on benchmarks.

DetailsMotivation: LLMs often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting diversity across tasks.

Method: GAPO extends GRPO by computing rewards over the group as a whole, enabling learning from group-level properties like diversity and coverage. Uses frequency-aware reward function to encourage uniform sampling over valid completions.

Result: GAPO-trained models produce valid and more diverse responses. Improves response diversity without compromising accuracy on standard benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro).

Conclusion: GAPO effectively addresses mode collapse in LLMs by leveraging group-level optimization, enhancing diversity while maintaining performance across various tasks.

Abstract: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

[41] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: Uni-MoE 2.0 is a fully open-source omnimodal large model that advances multimodal understanding, reasoning, and generation through dynamic-capacity MoE design, progressive training, and multimodal data matching, achieving SOTA performance across 85 benchmarks.

DetailsMotivation: To advance Lychee's Uni-MoE series in language-centric multimodal capabilities and create a competitive open-source omnimodal model that can handle diverse cross-modal inputs and tasks efficiently.

Method: Built on Qwen2.5-7B architecture with dynamic-capacity MoE design, Omni-Modality 3D RoPE for cross-modality alignment, progressive training with iterative GSPO-DPO reinforcement, and trained on 75B tokens of multimodal data with special speech/image generation tokens.

Result: Achieves SOTA or highly competitive performance on 85 benchmarks, surpassing Qwen2.5-Omni on over 50 of 76 benchmarks, with key strengths in video understanding (+7%), omnimodal understanding (+7%), audiovisual reasoning (+4%), speech processing (4.2% WER reduction), and image processing/generation.

Conclusion: Uni-MoE 2.0 demonstrates that carefully designed MoE architectures with progressive training strategies can achieve state-of-the-art multimodal performance with significantly less training data, making high-quality omnimodal AI more accessible through open-source release.

Abstract: We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

[42] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai

Main category: cs.CL

TL;DR: Proposes NOTAM semantic parsing to address complex linguistic structures in aviation safety notices, creates Knots dataset with 12,347 expert-annotated NOTAMs, and evaluates various parsing techniques achieving improved aviation text understanding.

DetailsMotivation: NOTAMs contain critical flight safety information but their complex linguistic structures and implicit reasoning make automated parsing challenging. Existing research only handles surface-level tasks without deep semantic understanding.

Method: Proposed NOTAM semantic parsing task emphasizing semantic inference and aviation domain knowledge integration. Constructed Knots dataset with 12,347 expert-annotated NOTAMs using multi-agent collaborative framework. Evaluated prompt-engineering strategies and model-adaptation techniques.

Result: Achieved substantial improvements in aviation text understanding and processing through systematic evaluation of various parsing approaches. Demonstrated effectiveness of the proposed semantic parsing method.

Conclusion: The approach successfully addresses the gap in deep semantic understanding of NOTAMs and provides valuable insights for automated NOTAM analysis systems, with publicly available code for further research.

Abstract: Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

[43] Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing

Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao

Main category: cs.CL

TL;DR: Reason-KE++ addresses LLM faithfulness gaps in multi-hop reasoning by using SFT+RL with stage-aware rewards for process-level supervision, achieving 95.48% SOTA on MQUAKE-CF-3k.

DetailsMotivation: SFT-based methods suffer from "faithfulness gap" where LLMs prioritize format mimicry over sound reasoning, allowing parametric priors to override contextual facts and cause factual hallucinations.

Method: Proposes Reason-KE++ framework combining SFT with RL and stage-aware reward mechanism that provides dense supervision for intermediate reasoning steps like decomposition and sub-answer correctness.

Result: Achieves new SOTA of 95.48% on MQUAKE-CF-3k (+5.28% improvement), demonstrating that naive outcome-only RL collapses reasoning integrity while process-aware alignment boosts performance.

Conclusion: For complex multi-hop reasoning tasks, aligning the reasoning process is essential for building trustworthy LLMs, as process-level faithfulness prevents factual hallucinations and ensures sound reasoning.

Abstract: Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a “faithfulness gap”: they optimize for format mimicry rather than sound reasoning. This gap enables the LLM’s powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning “Houston” from “NASA” despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.

[44] Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Sina Rashidi, Hossein Sameti

Main category: cs.CL

TL;DR: Direct speech-to-speech translation for Persian-English using synthetic data generation and self-supervised learning to overcome data scarcity in low-resource languages.

DetailsMotivation: Direct S2ST systems are attractive for their simplicity and low latency but require large parallel speech data, which is rarely available for low-resource languages like Persian.

Method: Three-component model: conformer-based encoder from self-supervised pre-training, causal transformer decoder with relative attention, and unit-based neural vocoder. Synthetic data generation pipeline using LLM translation and zero-shot TTS.

Result: Achieved 4.6 ASR BLEU improvement on Persian-English CVSS corpus with synthetic data, increasing available parallel speech by 6x.

Conclusion: Combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs.

Abstract: Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

[45] Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

Main category: cs.CL

TL;DR: EvoSynth is an autonomous framework that uses evolutionary synthesis to create novel code-based jailbreak methods for LLMs, achieving 85.5% attack success rate against robust models like Claude-Sonnet-4.5.

DetailsMotivation: Existing automated red teaming frameworks are limited to selecting or refining pre-existing attack strategies, lacking the ability to autonomously invent new attack mechanisms.

Method: Multi-agent system that autonomously engineers, evolves, and executes novel code-based attack algorithms with a code-level self-correction loop for iterative rewriting of attack logic.

Result: Achieved 85.5% Attack Success Rate against Claude-Sonnet-4.5 and generated significantly more diverse attacks than existing methods.

Conclusion: EvoSynth establishes a new state-of-the-art and opens a new direction for evolutionary synthesis of jailbreak methods.

Abstract: Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

[46] Adaptive Focus Memory for Language Models

Christopher Cruz

Main category: cs.CL

TL;DR: AFM is a dynamic context manager that reduces LLM token usage by 66% while maintaining safety performance, using adaptive fidelity levels for past messages based on relevance, recency, and importance.

DetailsMotivation: Current LLM memory strategies are inefficient - full conversation replay is expensive, while summarization and recency heuristics often lose safety-critical details like user allergies.

Method: AFM assigns each past message one of three fidelity levels (FULL, COMPRESSED, PLACEHOLDER) based on semantic similarity to current query, recency weighting, and importance classification, packing messages chronologically under token budget constraints.

Result: In safety benchmarks with peanut allergy scenarios, AFM retains allergy information across conversations, matches naive replay’s safety performance, and reduces average token usage by 66%.

Conclusion: AFM enables significant cost reduction in LLM inference without sacrificing safety or factual continuity, with modular implementation available for OpenAI-compatible APIs.

Abstract: Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels – FULL, COMPRESSED, or PLACEHOLDER – based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.

[47] On the Brittleness of LLMs: A Journey around Set Membership

Lea Hergert, Gábor Berend, Mario Szegedy, Gyorgy Turan, Márk Jelasity

Main category: cs.CL

TL;DR: LLMs fail on simple set membership queries despite excelling at complex reasoning, revealing fundamental brittleness in their understanding of basic concepts.

DetailsMotivation: To investigate the paradox where LLMs achieve superhuman performance on complex reasoning but fail on simple problems, raising concerns about reliability and interpretability.

Method: Systematic empirical evaluation of set membership queries across prompt phrasing, semantic structure, element ordering, and model choice using large-scale controlled experiments.

Result: LLM performance on elementary set membership tasks is consistently brittle and unpredictable across all dimensions, indicating fragmented and convoluted understanding of basic set concepts.

Conclusion: The simplicity of set membership problems enables comprehensive mapping of LLM failure modes, making this approach a valuable methodology for general LLM evaluation.

Abstract: Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries – among the most fundamental forms of reasoning – using tasks like Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' understanding’’ of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.

[48] Evidence of Phase Transitions in Small Transformer-Based Language Models

Noah Hong, Tao Hong

Main category: cs.CL

TL;DR: Phase transitions in language models occur not just in large models but also in small transformers, detectable directly in linear training space and early in training through vocabulary analysis and statistical measures.

DetailsMotivation: To investigate whether phase transitions are unique to large models or observable in small transformers, detectable in linear training space without log rescaling, and whether they emerge early in training.

Method: Train a small GPT-style transformer on character-level corpus, track vocabulary usage evolution (average word length, correct/incorrect words, vocabulary diversity), and apply Poisson/sub-Poisson statistics to quantify word connections and reorganization.

Result: Reveals distinct transition points during training not visible in standard loss/validation curves but detectable through vocabulary- and statistics-based probes, showing phase transitions occur even in modest models.

Conclusion: Phase-transition reorganizations are a general feature of language model training, observable in small models, detectable in linear space, and occurring early as coherence emerges, providing new insights into nonlinear training dynamics.

Abstract: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors

[49] LLM Reinforcement in Context

Thomas Rivasseau

Main category: cs.CL

TL;DR: The paper proposes using interruptions (control sentences added periodically to user input) to strengthen LLM alignment and prevent jailbreaking as input length increases.

DetailsMotivation: Current LLM alignment methods don't scale well with input length, and jailbreak probability increases with longer conversations, creating a gap in scalable alignment solutions.

Method: Adding interruptions (control sentences) to user input approximately every x tokens, which can be generalized to Chain-of-Thought processes to prevent scheming behavior.

Result: The paper proposes this method but does not provide experimental results in the abstract.

Conclusion: Interruptions offer a potential scalable solution to strengthen LLM alignment against jailbreaking as input length increases.

Abstract: Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.

[50] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Hayden Moore, Asfahan Shah

Main category: cs.CL

TL;DR: LLMs show performance variability in autoformalization when given semantically similar paraphrased natural language inputs, with minor changes significantly impacting output quality.

DetailsMotivation: To investigate LLM sensitivity to paraphrased natural language inputs in autoformalization, building on findings from text-to-SQL research that showed LLM sensitivity to semantic-preserving paraphrases.

Method: Evaluated LLM robustness by generating paraphrased natural language statements from MiniF2F and Lean 4 ProofNet benchmarks, measuring semantic and compilation validity across two modern LLMs through cross-evaluation.

Result: Found significant performance variability across paraphrased inputs, demonstrating that minor shifts in natural language statements can substantially impact model outputs.

Conclusion: LLMs remain sensitive to paraphrased natural language inputs in autoformalization tasks, highlighting robustness challenges despite their impressive capabilities.

Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

[51] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals

Ruiyu Wang, Yuzhang Xie, Xiao Hu, Carl Yang, Jiaying Lu

Main category: cs.CL

TL;DR: BioMedJImpact is a comprehensive dataset for analyzing journal impact in biomedicine, integrating bibliometrics, collaboration features, and AI engagement indicators extracted via a novel LLM pipeline.

DetailsMotivation: To address the lack of open resources that capture how collaboration structures and AI research jointly shape journal prestige in biomedicine.

Method: Built from 1.74M PubMed Central articles across 2,744 journals, using bibliometric indicators, collaboration features, and a reproducible three-stage LLM pipeline for AI engagement extraction.

Result: Journals with higher collaboration intensity achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings.

Conclusion: BioMedJImpact serves as both a comprehensive dataset and validated methodological framework for content-aware scientometric analysis of scientific impact and innovation dynamics in biomedicine.

Abstract: Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at https://github.com/JonathanWry/BioMedJImpact.

[52] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Niranjan Chebrolu, Gerard Christopher Yeo, Kokil Jaidka

Main category: cs.CL

TL;DR: Targeted activation engineering enables LLaMA 3.1-8B to exhibit more human-like emotional nuances without extensive fine-tuning by identifying key intervention points and applying emotional expression vectors.

DetailsMotivation: Current LLMs lack nuanced human-like emotional expression, and existing alignment techniques are either surface-level or require extensive fine-tuning, creating a need for more precise emotional steering methods.

Method: Used attribution patching to identify causally influential components, derived emotional expression vectors from activation differences between contrastive text pairs (positive vs. negative emotional examples), and applied these vectors to conversational prompts.

Result: Steered responses showed increased positive sentiment (joy, trust) and more frequent first-person pronoun usage, indicating greater personal engagement and enhanced emotional characteristics.

Conclusion: The approach provides a precise and interpretable framework for emotional steering in conversational AI, offering new directions for studying and implementing human-like emotional expression in language models.

Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.

[53] Quantifying consistency and accuracy of Latent Dirichlet Allocation

Saranzaya Magsarjav, Melissa Humphries, Jonathan Tuke, Lewis Mitchell

Main category: cs.CL

TL;DR: LDA topic models show internal consistency across multiple runs but fail to recover true underlying topics, raising concerns about whether they capture meaningful patterns or just noise.

DetailsMotivation: Probabilistic topic models produce inconsistent results due to their stochastic nature, affecting replicability and reliability, and raising doubts about whether they capture meaningful topics or just noise.

Method: Defined a new stability measure combining accuracy and consistency, used LDA’s generative properties to create corpora with ground truth topics, and ran LDA 50 times on generated corpora to analyze output variability.

Result: LDA can correctly determine the underlying number of topics but is more internally consistent than accurate - multiple reruns return similar topics that don’t match the true underlying topics.

Conclusion: LDA demonstrates internal consistency across runs but fails to recover true topics, suggesting it may capture patterns that are consistent but not necessarily meaningful or accurate representations of the underlying content.

Abstract: Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.

[54] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

Kang Yin, Hye-Bin Shin

Main category: cs.CL

TL;DR: NeuroLex is a domain-specific language model trained on EEG reports that outperforms general models in EEG text tasks with better accuracy and robustness.

DetailsMotivation: General-purpose language models fail to capture the domain-specific linguistic conventions and diagnostic characteristics of clinical EEG reports.

Method: Used span-corruption pretraining and instruction-style fine-tuning on EEG report polishing, paragraph summarization, and terminology question answering from the Harvard EEG Database.

Result: Achieves lower perplexity, higher extraction/summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared to general models of same scale.

Conclusion: NeuroLex bridges biomedical text modeling and brain-computer interface applications, providing a foundation for interpretable and language-driven neural decoding.

Abstract: Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.

[55] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu, Ziyan Chen, Tiejun Zhao

Main category: cs.CL

TL;DR: This paper provides a systematic review of Multimodal Chain-of-Thought (MCoT), analyzing its background, methods, evaluation, applications, challenges, and future directions for enhancing reasoning in multimodal large language models.

DetailsMotivation: To address challenges in Multimodal Large Language Models (MLLMs) such as opaque reasoning paths and insufficient generalization, by extending Chain-of-Thought reasoning from language models to multimodal domains to improve reasoning transparency and interpretability.

Method: Systematic review approach analyzing MCoT from three aspects: CoT paradigms, post-training stage, and inference stage, while examining underlying mechanisms and organizing existing research systematically.

Result: Comprehensive analysis of MCoT’s theoretical foundations, methodological approaches, evaluation benchmarks, application scenarios, and identification of current challenges in the field.

Conclusion: MCoT shows promise for enhancing multimodal reasoning capabilities but faces challenges that require future research directions to address current limitations and advance the field.

Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on “Multimodal Chain-of-Thought” (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.

[56] Classification of Hope in Textual Data using Transformer-Based Models

Chukwuebuka Fortunate Ijezue, Tania-Amanda Fredrick Eneye, Maaz Amjad

Main category: cs.CL

TL;DR: Transformer-based approach for hope expression classification using BERT, GPT-2, and DeBERTa architectures, with BERT showing best performance and efficiency.

DetailsMotivation: To develop computational methods for analyzing hope expressions in text, with applications in mental health and social media analysis.

Method: Developed and compared three transformer architectures (BERT, GPT-2, DeBERTa) for binary and multiclass hope classification tasks.

Result: BERT achieved best performance (84.49% binary, 72.03% multiclass accuracy) with lowest computational cost. GPT-2 had lowest accuracy but excelled at sarcasm detection (92.46% recall).

Conclusion: Architectural suitability may outweigh model size for specialized emotion detection tasks, providing a framework for computational hope analysis.

Abstract: This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.

Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg, Aniko Hannak, Christo Wilson

Main category: cs.CL

TL;DR: Google’s AI Overviews and Featured Snippets show concerning inconsistencies and lack medical safeguards in health information, with 33% inconsistency between features and only 11% of AIO and 7% of FS containing medical disclaimers.

DetailsMotivation: To evaluate the quality and consistency of AI-generated health information in Google Search features like AI Overviews and Featured Snippets, which users rely on but have no control over.

Method: Systematic algorithm audit of 1,508 baby care and pregnancy queries using a robust evaluation framework assessing answer consistency, relevance, medical safeguards, source categories, and sentiment alignment.

Result: 33% inconsistency between AIO and FS on same page; high relevance but critically lacking medical safeguards (11% AIO, 7% FS); health/wellness websites dominate sources but FS often link to commercial sources.

Conclusion: Concerning gaps in AI-mediated health information quality demonstrate need for stronger controls; methodology provides transferable framework for auditing AI systems in high-stakes domains impacting user well-being.

Abstract: Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.

[58] Visual Room 2.0: Seeing is Not Understanding for MLLMs

Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang

Main category: cs.CL

TL;DR: The paper introduces Visual Room 2.0, a hierarchical benchmark evaluating MLLMs’ perception-cognition alignment across 17 tasks, finding that MLLMs have stronger perceptual than cognitive abilities and that cognition doesn’t causally depend on perception.

DetailsMotivation: To test whether multi-modal large language models (MLLMs) truly understand what they see, extending Searle's Chinese Room argument to the visual domain - proposing that seeing is not understanding.

Method: Created Visual Room 2.0 benchmark with 350 multi-modal samples and 2,100 questions across three hierarchical levels (low, middle, high) covering perception tasks (attribute recognition to scene understanding) and cognition tasks (textual entailment to causal/social reasoning).

Result: Evaluation of 10 SoTA MLLMs showed: (1) 8.0% higher perceptual than cognitive competence, (2) cognition not causally dependent on perception-based reasoning, (3) cognition scales with model size but perception doesn’t consistently improve with larger variants.

Conclusion: The work operationalizes ‘Seeing ≠ Understanding’ as a testable hypothesis and provides a new paradigm for evaluating MLLMs from perceptual processing to cognitive reasoning.

Abstract: Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle’s Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at https://huggingface.co/datasets/LHK2003/PCBench.

[59] Fine-Tuned LLMs Know They Don’t Know: A Parameter-Efficient Approach to Recovering Honesty

Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou, Qingyun Sun, Jianxin Li

Main category: cs.CL

TL;DR: HCNR surgically restores honesty in fine-tuned LLMs by identifying and repairing expression-governing neurons, achieving significant honesty recovery with much less data and faster than baseline methods.

DetailsMotivation: Supervised fine-tuning severely undermines LLM honesty in high-stakes domains, but existing recovery methods are data-intensive and assume deep corruption of knowledge boundary recognition, which may not be accurate.

Method: HCNR identifies honesty-critical neurons that govern expression and restores them to pre-trained state while harmonizing with task-oriented neurons using Hessian-guided compensation.

Result: Recovers 33.25% of compromised honesty across four QA tasks and five LLM families, with at least 2.23x speedup and over 10x less data compared to baselines.

Conclusion: HCNR offers a practical solution for trustworthy LLM deployment by surgically repairing suppressed honesty expression capacity rather than assuming deep corruption.

Abstract: The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models’ ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.

[60] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Declan Jackson, William Keating, George Cameron, Micah Hill-Smith

Main category: cs.CL

TL;DR: AA-Omniscience benchmark evaluates language models on factual recall and knowledge calibration across 6,000 questions from 42 topics in 6 domains, using an Omniscience Index metric that penalizes hallucinations and rewards appropriate abstention.

DetailsMotivation: Existing evaluations focus on general capabilities but reliable model use requires factual accuracy and recognition of knowledge gaps across domains.

Method: Created benchmark with 6,000 questions from authoritative sources covering 42 topics in 6 economically relevant domains. Measures Omniscience Index (-100 to 100) that jointly evaluates factual recall while penalizing hallucinations and rewarding abstention when uncertain.

Result: Claude 4.1 Opus achieved highest score (4.8), one of only three models scoring above zero. Performance varies by domain with different labs leading across the six domains, revealing persistent factuality and calibration weaknesses in frontier models.

Conclusion: Models should be chosen based on specific domain requirements rather than general performance for knowledge-intensive tasks, due to significant performance variability across domains.

Abstract: Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model’s Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.

[61] How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm

Kasun Wickramasinghe, Nisansa de Silva

Main category: cs.CL

TL;DR: This paper critically examines whether multilingual embeddings are universally superior to aligned monolingual embeddings, using Bilingual Lexicon Induction (BLI) as the evaluation metric. It identifies limitations of BLI and proposes novel stem-based approaches and vocabulary pruning techniques for more accurate alignment assessment.

DetailsMotivation: To investigate if multilingual embeddings truly outperform aligned monolingual embeddings in all aspects, questioning whether the higher computational cost of multilingual models is always justified and exploring potential compromises between the two approaches.

Method: Evaluates traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques on BLI tasks across high-resource and low-resource languages. Proposes stem-based BLI approach and vocabulary pruning technique to address limitations of conventional word-based BLI.

Result: Found that BLI doesn’t always measure true alignment accurately. Combined embedding alignment techniques generally perform better, while multilingual embeddings excel mainly in low-resource language scenarios. Language family relationships impact performance.

Conclusion: Multilingual embeddings aren’t universally superior - their advantage is context-dependent, particularly strong for low-resource languages. Proposed novel evaluation methods provide more accurate assessment of embedding space alignment.

Abstract: Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).

[62] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu

Main category: cs.CL

TL;DR: Spark-Prover-X1-7B is a 7B parameter model trained via a three-stage framework to enhance theorem proving capabilities in lightweight LLMs, achieving state-of-the-art performance among similarly-sized open-source models.

DetailsMotivation: To address the scarcity of diverse and high-quality formal language data that constrains progress in automated theorem proving with LLMs.

Method: Three-stage framework: 1) Continuous pre-training on mathematical corpus with novel data tasks including CoT-augmented state prediction, 2) Supervised Fine-tuning in expert iteration loop, 3) Group Relative Policy Optimization for challenging problems.

Result: Achieves 37.0% average pass rate (pass@32), solves 27 problems on PutnamBench, and achieves 24.0% on CombiBench, demonstrating state-of-the-art performance among similarly-sized models.

Conclusion: The diverse training data and progressively refined training pipeline effectively enhances formal reasoning capabilities of lightweight LLMs.

Abstract: Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a “CoT-augmented state prediction” task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover’s capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.

[63] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Chuyuan Li, Giuseppe Carenini

Main category: cs.CL

TL;DR: BeDiscovER is a comprehensive benchmark for evaluating discourse understanding in modern LLMs, covering 5 tasks across 52 datasets at lexicon, sentential, and document levels.

DetailsMotivation: To create an up-to-date evaluation suite for assessing discourse-level knowledge in modern reasoning language models, covering both established and novel discourse challenges.

Method: Compiled 5 publicly available discourse tasks across different levels (lexicon, sentential, documental) with 52 datasets, including discourse parsing, temporal relation extraction, discourse particle disambiguation, and multilingual discourse relation classification.

Result: State-of-the-art models show strong performance in arithmetic temporal reasoning but struggle with full document reasoning and subtle semantic/discourse phenomena like rhetorical relation recognition.

Conclusion: Current LLMs have limitations in comprehensive discourse understanding, particularly with document-level reasoning and nuanced discourse phenomena, highlighting areas for future improvement.

Abstract: We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just’’), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

[64] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang, Wenxia Zhao, Tianbin Li, Bing Han, Jie Xu, Junyan Wu

Main category: cs.CL

TL;DR: LLMs show modest accuracy in identifying CONSORT adherence in RCTs, with good performance on compliant items but poor detection of non-compliant and not applicable items, making them suitable only as preliminary screening tools.

DetailsMotivation: Manual verification of CONSORT adherence is laborious and time-consuming, creating bottlenecks in peer review and evidence synthesis. This study aimed to evaluate if LLMs could automate this process.

Method: Systematic evaluation of contemporary LLMs using a golden standard dataset of 150 published RCTs across medical specialties, under zero-shot settings. Primary outcome was macro-averaged F1-score for three-class classification.

Result: Overall performance was modest with top models Gemini-2.5-Flash and DeepSeek-R1 achieving macro F1 scores of 0.634. Models performed well on compliant items (F1 > 0.850) but poorly on non-compliant and not applicable items (F1 < 0.400). GPT-4o underperformed with F1-score of 0.521.

Conclusion: LLMs show potential as preliminary screening assistants for identifying well-reported items but cannot reliably detect reporting omissions or methodological flaws, making them unsuitable for replacing human expertise in trial quality appraisal.

Abstract: The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen’s Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.

[65] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang, Ling Tian, Ke Yan

Main category: cs.CL

TL;DR: Agent-Event-Coder (AEC) is a multi-agent framework that treats zero-shot event extraction as a structured code-generation process, using specialized agents for retrieval, planning, coding, and verification to produce schema-consistent outputs.

DetailsMotivation: Zero-shot event extraction is challenging for LLMs due to complex reasoning needs, leading to incomplete outputs, misclassified triggers, missing arguments, and schema violations with direct prompting approaches.

Method: AEC decomposes event extraction into specialized subtasks handled by dedicated LLM agents: retrieval, planning, coding, and verification. Event schemas are represented as executable class definitions for deterministic validation and iterative refinement.

Result: Experiments across five diverse domains and six LLMs show AEC consistently outperforms prior zero-shot baselines, demonstrating improved precision, completeness, and schema consistency in event extraction.

Conclusion: Treating event extraction as a structured code-generation process through collaborative agent workflows enables LLMs to achieve precise, complete, and schema-consistent extractions in zero-shot settings.

Abstract: Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs–such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks–retrieval, planning, coding, and verification–each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.

[66] A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition

Nigar Alishzade, Gulchin Abdullayeva

Main category: cs.CL

TL;DR: Comparative analysis shows Vanilla Transformer outperforms ConvLSTM for sign language recognition, achieving up to 76.8% Top-1 accuracy on Azerbaijani Sign Language and 88.3% on American Sign Language datasets.

DetailsMotivation: To systematically compare recurrent and attention-based neural architectures for isolated sign language recognition and understand their respective strengths and trade-offs.

Method: Implemented and evaluated two models - ConvLSTM (recurrent) and Vanilla Transformer (attention-based) on Azerbaijani Sign Language Dataset (AzSLD) and Word-Level American Sign Language (WLASL) dataset.

Result: Vanilla Transformer consistently outperformed ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. ConvLSTM was more computationally efficient but less accurate.

Conclusion: Transformers excel in accuracy and signer independence while ConvLSTM offers computational efficiency and better temporal modeling, providing guidance for architecture selection based on application requirements and resource constraints.

Abstract: This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.

[67] Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

Sourya Dipta Das, Shubham Kumar, Kuldeep Yadav

Main category: cs.CL

TL;DR: Zero-shot grammar competency estimation framework using LLM-generated pseudo labels from unlabeled data, achieving high accuracy without manual annotations.

DetailsMotivation: Grammar assessment in spoken language is challenging due to spontaneous, unstructured nature, and expert annotation requirements make large-scale data creation impractical.

Method: Use LLM-generated predictions on unlabeled data with grammar competency rubric prompts as pseudo labels, train transformer model with noise-handling framework.

Result: Model achieves high accuracy in grammar competency estimation; LLM choice and clean-to-noisy sample ratio critically affect performance and stability.

Conclusion: Framework enables scalable, low-resource grammar assessment systems with robust and interpretable performance.

Abstract: Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.

[68] Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar

Main category: cs.CL

TL;DR: First Bangla corpus distinguishing repetition disfluency from morphological reduplication in ASR transcripts, with fine-tuned BanglaBERT achieving 84.78% accuracy.

DetailsMotivation: Standard disfluency correction fails in Bangla by deleting valid linguistic information when word repetitions can be either ASR errors or grammatical reduplication.

Method: Created 20,000-row manually annotated Bangla corpus, benchmarked with multilingual LLMs (few-shot prompting) and fine-tuned encoder models (BanglaBERT).

Result: LLMs achieved up to 82.68% accuracy, while fine-tuned BanglaBERT achieved highest accuracy of 84.78% and F1 score of 0.677.

Conclusion: Establishes linguistically-informed baseline for semantic-preserving text normalization in Bangla, with fine-tuning proving superior to LLM prompting.

Abstract: Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

[69] TCM-5CEval: Extended Deep Evaluation Benchmark for LLM’s Comprehensive Clinical Research Competence in Traditional Chinese Medicine

Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li, Bing Han, Wenchao Tang, Jie Xu, Ming Li

Main category: cs.CL

TL;DR: TCM-5CEval is a comprehensive benchmark evaluating LLMs in Traditional Chinese Medicine across 5 dimensions, revealing performance disparities and exposing critical reasoning fragility through permutation testing.

DetailsMotivation: LLMs show exceptional general capabilities but require rigorous evaluation in specialized, culturally-rich fields like TCM. Prior work (TCM-3CEval) identified knowledge gaps and cultural-contextual alignment needs, prompting a more granular assessment.

Method: Developed TCM-5CEval benchmark with 5 dimensions: Core Knowledge, Classical Literacy, Clinical Decision-making, Chinese Materia Medica, and Clinical Non-pharmacological Therapy. Evaluated 15 prominent LLMs and conducted permutation-based consistency testing to assess reasoning stability.

Result: Significant performance disparities among models, with deepseek_r1 and gemini_2_5_pro as top performers. Models proficient in foundational knowledge but struggle with classical text interpretation. Permutation testing revealed widespread fragility - all models showed substantial performance degradation with varied question option ordering, indicating positional bias sensitivity.

Conclusion: TCM-5CEval provides detailed diagnostic capabilities for LLMs in TCM and exposes fundamental weaknesses in reasoning stability. The benchmark is available on Medbench platform to promote standardized comparison and further research.

Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek_r1 and gemini_2_5_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the “In-depth Challenge for Comprehensive TCM Abilities” special track.

[70] Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Ronit D. Gross, Yanir Harel, Ido Kanter

Main category: cs.CL

TL;DR: A quantitative method for estimating translation entropy by analyzing token replacement probabilities that preserve identical translations, enabling objective benchmarking of machine translators.

DetailsMotivation: The need for quantitative objective methods to assess machine translation performance, as current methods lack measurable entropy-based evaluation despite the prevalence of encoder-decoder architectures.

Method: Analyze statistics of token replacements in pivot sentences that yield identical translations, calculating probabilities of token substitutions while preserving translation output to estimate translation entropy.

Result: Translation entropy can be measured and enhanced along decoder blocks; method allows quantitative ranking of translators and reveals asymmetric mutual translation entropy; multi-token replacement shows multiplicative degeneracy effect.

Conclusion: Translation entropy is established as a measurable property enabling objective benchmarking of artificial translators, with validation on MarianMT, T5-Base and NLLB-200 systems.

Abstract: The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator’s overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.

[71] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

Mihai Dan Nadas, Laura Diosan

Main category: cs.CL

TL;DR: Evaluation of various LLMs for Romanian diacritic restoration, showing GPT-4o achieves high accuracy while Llama models show variability.

DetailsMotivation: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks like Romanian.

Method: Tested multiple LLMs (GPT-3.5, GPT-4, GPT-4o, Gemini 1.0 Pro, Llama 2/3, Mixtral 8x7B, airoboros 70B, RoLlama 2 7B) using comprehensive corpus with various prompt templates from zero-shot to multi-shot instructions.

Result: GPT-4o achieves high diacritic restoration accuracy, consistently surpassing baseline, while Meta’s Llama family exhibits wider variability.

Conclusion: Model architecture, training data, and prompt design significantly impact diacritic restoration performance, outlining directions for improving NLP tools for diacritic-rich languages.

Abstract: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini 1.0 Pro, Meta’s Llama 2 and Llama 3, MistralAI’s Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro’s RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta’s Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.

[72] Seeing isn’t Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman, Joseph James, Chenghua Lin

Main category: cs.CL

TL;DR: VLMs struggle to interpret speech spectrograms and waveforms, performing near chance levels in phonetic transcription tasks despite being trained on multimodal data.

DetailsMotivation: To benchmark VLMs' ability to act as phoneticians by interpreting speech representations (spectrograms and waveforms) and understand if they can perform specialized tasks requiring domain-specific knowledge.

Method: Created a dataset of 4k+ English words with spectrogram/waveform figures, tested VLMs through multiple-choice tasks with phonemic/graphemic transcription prediction using distractor transcriptions based on phonemic edit distance.

Result: Both zero-shot and finetuned VLMs rarely performed above chance level, indicating poor understanding of speech representations.

Conclusion: VLMs require specific parametric knowledge for interpreting speech figures, not just paired multimodal samples, highlighting limitations in their phonetic analysis capabilities.

Abstract: With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

[73] Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach

Main category: cs.CL

TL;DR: SoCE is a model souping method that uses benchmark composition to identify expert models for weakly-correlated categories and combines them with optimized weighted averaging instead of uniform weights.

DetailsMotivation: Traditional LLM training is resource-intensive, and existing model souping approaches use uniform averaging which doesn't account for category-specific expertise in models.

Method: Identify weakly-correlated benchmark categories, select expert models for each category cluster, and apply non-uniform weighted averaging to combine them.

Result: Improves performance and robustness across multiple domains including multilingual capabilities, tool calling, and math; achieves state-of-the-art results on Berkeley Function Calling Leaderboard.

Conclusion: SoCE provides a principled approach to model souping that outperforms uniform averaging by leveraging category-specific expertise through optimized weighted combination.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies “expert” models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

Shufan Yang, Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Cong Wang, Shiping Ge, Yuchen Fu, Qing Gu

Main category: cs.CL

TL;DR: RegionMarker is a semantic watermarking framework that protects Embedding-as-a-Service models from extraction attacks by defining trigger regions in low-dimensional space and embedding watermarks into text embeddings associated with these regions.

DetailsMotivation: Existing EaaS copyright protection methods are vulnerable to model extraction attacks and current watermarking approaches only resist limited attack types, failing to provide comprehensive protection against various threats.

Method: RegionMarker defines trigger regions in low-dimensional space using a secret dimensionality reduction matrix, randomly selects these regions, and injects watermarks into text embeddings associated with the regions. The approach uses text embeddings themselves as watermarks.

Result: Extensive experiments on various datasets demonstrate that RegionMarker effectively resists different attack methods, including paraphrasing and dimension-perturbation attacks, making it difficult for watermark removal attacks to evade detection.

Conclusion: RegionMarker provides comprehensive copyright protection for EaaS by being resilient to multiple attack types through its region-triggered semantic watermarking approach, thereby protecting model providers from economic losses.

Abstract: Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide \textit{comprehensive} protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.

[75] AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects

Maram Alharbi, Salmane Chafik, Saad Ezzini, Ruslan Mitkov, Tharindu Ranasinghe, Hansi Hettiarachchi

Main category: cs.CL

TL;DR: Shared task on Arabic sentiment analysis using hotel reviews translated into Saudi and Moroccan dialects, with top system achieving 0.81 F1 score.

DetailsMotivation: The hospitality industry in Arab world needs advanced Arabic sentiment analysis tools to leverage customer feedback for service improvement.

Method: Created multi-dialect dataset from 538 sentiment-balanced hotel reviews, originally in MSA and translated into Saudi and Moroccan dialects, validated by native speakers.

Result: Over 40 teams registered, 12 submitted systems; top system achieved F1 score of 0.81.

Conclusion: Demonstrates feasibility but ongoing challenges in sentiment analysis across Arabic dialects for real-world customer experience applications.

Abstract: The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

[76] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Kajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart, Anna Korhonen

Main category: cs.CL

TL;DR: Study examines how PEFT/LoRA fine-tuning on single task-language pairs transfers to other tasks and languages, revealing consistent patterns of positive cross-language transfer within same tasks but collateral degradation across different tasks.

DetailsMotivation: To understand how improvements in one task or language affect other tasks and languages and their combinations, which remains poorly understood despite LLMs' strong performance across tasks and languages.

Method: Conducted controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, fine-tuning each model on single task-language source and measuring transfer as percentage-point change versus baseline when evaluated on all other task-language target pairs.

Result: Uncovered two consistent patterns: 1) pronounced on-task vs. off-task asymmetry with positive Matched-Task (Cross-Language) transfer but collateral degradation in off-task transfer, 2) stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients).

Conclusion: Findings have implications for risk-aware fine-tuning and model specialization, highlighting the need to consider transfer effects when optimizing LLMs for specific tasks or languages.

Abstract: Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.

[77] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Xiaoyan Dong

Main category: cs.CL

TL;DR: PEDIASBench evaluates 12 LLMs as pediatricians, finding they excel in basic knowledge but struggle with complex reasoning, dynamic decision-making, and humanistic care, suggesting potential for support roles but not independent practice.

DetailsMotivation: To assess whether large language models can function as competent pediatricians in real-world clinical settings by systematically evaluating their capabilities across multiple dimensions.

Method: Developed PEDIASBench evaluation framework assessing LLMs across three dimensions: basic knowledge application, dynamic diagnosis/treatment capability, and pediatric medical safety/ethics. Evaluated 12 models across 19 pediatric subspecialties and 211 diseases.

Result: Top models achieved >90% accuracy on basic knowledge but performance declined ~15% with complexity. Models struggled with integrative reasoning, real-time patient adaptation, and humanistic sensitivity. DeepSeek-R1 scored highest in case reasoning (0.58), Qwen2.5-72B best in ethics/safety (92.05%).

Conclusion: Current LLMs cannot independently perform pediatric care but show promise for decision support, education, and communication. Future development should focus on multimodal integration and clinical feedback loops to enhance safety and human-AI collaboration.

Abstract: With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

[78] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: PAL-Bench is a new benchmark for evaluating personalization in service-oriented dialogue assistants, featuring PAL-Set - the first Chinese dataset with multi-session user logs and dialogue histories, and H²Memory - a hierarchical memory framework for personalized response generation.

DetailsMotivation: Existing approaches overlook long-term interaction complexities and fail to capture users' subjective characteristics in service-oriented human-agent interactions, highlighting the need for personalized dialogue assistants that understand user-specific traits.

Method: Developed a multi-step LLM-based synthesis pipeline to create PAL-Set dataset, and proposed H²Memory - a hierarchical and heterogeneous memory framework with retrieval-augmented generation for personalized response generation.

Result: Comprehensive experiments on PAL-Bench and external datasets demonstrate the effectiveness of the proposed memory framework in improving personalized service-oriented interactions.

Conclusion: PAL-Bench provides a valuable benchmark for evaluating personalization capabilities, and H²Memory framework effectively addresses the gaps in capturing long-term user characteristics for personalized dialogue assistance.

Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

[79] Non-Linear Scoring Model for Translation Quality Evaluation

Serge Gladkoff, Lifeng Han, Katerina Gasova

Main category: cs.CL

TL;DR: A non-linear scoring model for translation quality evaluation that uses logarithmic error tolerance to better align with human perception across different text lengths, replacing biased linear extrapolation methods.

DetailsMotivation: Traditional linear error-to-penalty scaling in translation quality evaluation produces biased judgments across different sample sizes, misaligning with expert intuition by over-penalizing short texts and under-penalizing long ones.

Method: Proposes a two-parameter logarithmic model E(x) = a * ln(1 + b * x) calibrated from tolerance points, supported by psychophysical evidence (Weber-Fechner law, Cognitive Load Theory) showing error perception diminishes logarithmically with scale.

Result: Empirical data from three enterprise environments confirms acceptable error counts grow logarithmically with sample size. The model improves interpretability, fairness, and inter-rater reliability while maintaining compatibility with existing evaluation workflows.

Conclusion: The logarithmic scoring model advances translation quality evaluation toward more accurate, scalable assessment that better reflects human perception, providing a stronger foundation for both human and AI-generated text evaluation.

Abstract: Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

[80] Aspect-Level Obfuscated Sentiment in Thai Financial Disclosures and Its Impact on Abnormal Returns

Attapol T. Rutherford, Sirisak Chueykamhang, Thachaparn Bunditlurdruk, Nanthicha Angsuwichitkul

Main category: cs.CL

TL;DR: This paper develops an Aspect-Based Sentiment Analysis (ABSA) approach to decode obfuscated sentiment in Thai financial annual reports, creates annotation guidelines, benchmarks classification models, and shows market reactions to specific aspects.

DetailsMotivation: Financial reports often contain obfuscated language that presents positive or neutral outlooks even when underlying conditions are unfavorable, making accurate sentiment analysis crucial for understanding market behavior.

Method: Developed specific guidelines for annotating obfuscated sentiment, annotated over 100 financial reports, benchmarked various text classification models, and conducted an event study to evaluate impact on stock prices.

Result: Demonstrated strong performance in sentiment classification and found that market reactions are selectively influenced by specific aspects within the reports.

Conclusion: The findings highlight the complexity of sentiment analysis in financial texts and emphasize the importance of addressing obfuscated language to accurately assess market sentiment.

Abstract: Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.

[81] Applying Large Language Models to Characterize Public Narratives

Elinor Poole-Dayan, Daniel T Kessler, Hannah Chiou, Margaret Hughes, Emily S Lin, Marshall Ganz, Deb Roy

Main category: cs.CL

TL;DR: LLMs can automate public narrative annotation with near-human performance (F1=0.80), enabling scalable analysis of civic stories and political rhetoric.

DetailsMotivation: Public narratives are important for leadership and civic mobilization but are hard to analyze systematically due to subjective interpretation and expensive expert annotation.

Method: Developed computational framework using LLMs to automate qualitative annotation based on expert codebook, evaluated against human experts across multiple narratives and codes.

Result: LLMs achieved near-human-expert performance with average F1 score of 0.80 across 8 narratives and 14 codes, successfully analyzed 22 stories and political speeches.

Conclusion: LLM-assisted annotation shows strong potential for scalable narrative analysis in civic storytelling, though limitations exist that require future research.

Abstract: Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.

[82] Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen

Main category: cs.CL

TL;DR: Proposed a taxonomy for text-to-SQL classification and used it to create SQL-Synth dataset, which shows greater diversity than existing benchmarks and reveals limitations in current LLMs’ performance.

DetailsMotivation: Existing text-to-SQL datasets have limited coverage and fail to capture real-world diversity, necessitating better evaluation frameworks and more comprehensive datasets.

Method: Developed a taxonomy for text-to-SQL classification across multiple dimensions, then used this taxonomy with LLMs to synthesize a new dataset (SQL-Synth) through a guided pipeline.

Result: SQL-Synth exhibits greater diversity and coverage than existing benchmarks, and experiments show current LLMs perform poorly on it but can be substantially improved through fine-tuning.

Conclusion: The taxonomy enables comprehensive dataset analysis and LLM performance evaluation, and can guide construction of training data to improve text-to-SQL capabilities.

Abstract: Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.

[83] Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: O-Mem is a novel memory framework that uses active user profiling to dynamically extract and update user characteristics from interactions, achieving state-of-the-art performance on memory benchmarks while improving response efficiency.

DetailsMotivation: Existing LLM-powered agents struggle with long-term interactions in complex environments due to limitations in contextual consistency and dynamic personalization. Current memory systems relying on semantic grouping often miss critical user information and introduce retrieval noise.

Method: O-Mem uses active user profiling to dynamically extract and update user characteristics and event records from proactive interactions. It supports hierarchical retrieval of persona attributes and topic-related context for adaptive personalized responses.

Result: O-Mem achieves 51.76% on LoCoMo benchmark (3% improvement over LangMem) and 62.99% on PERSONAMEM (3.5% improvement over A-Mem). It also boosts token and interaction response time efficiency compared to previous frameworks.

Conclusion: O-Mem demonstrates promising directions for developing efficient and human-like personalized AI assistants by addressing key limitations in memory systems for long-term interactions.

Abstract: Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.

[84] Why is “Chicago” Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues

Jiaming Qu, Mengtian Guo, Yue Wang

Main category: cs.CL

TL;DR: Using LLMs to translate machine-learned lexical cues into human-understandable language phenomena for detecting deceptive reviews.

DetailsMotivation: Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can detect them but their features are difficult for humans to interpret.

Method: Using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that differentiate deceptive from genuine reviews.

Result: The language phenomena obtained are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena from LLMs’ prior knowledge or in-context learning.

Conclusion: These language phenomena can help people critically assess online review credibility where deception detection classifiers are unavailable.

Abstract: Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs’ prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.

[85] Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J

Main category: cs.CL

TL;DR: The paper proposes TAI framework using LLMs and diffusion models to translate and generate images for Indian poetry, addressing accessibility gaps for non-native speakers.

DetailsMotivation: Indian poetry's linguistic complexity and cultural richness pose comprehension challenges, especially for non-native speakers, and existing works have overlooked Indian language poems.

Method: TAI framework with translation module using Odds Ratio Preference Alignment Algorithm and image generation module using semantic graphs to capture metaphors and meanings.

Result: Experimental evaluation shows TAI Diffusion outperforms baselines in poem image generation, and introduces MorphoVerse Dataset with 1,570 poems across 21 Indian languages.

Conclusion: The work enhances accessibility of Indian poetry globally, supporting SDG goals for quality education and reduced inequalities through improved translation and visual comprehension.

Abstract: Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.

[86] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Chengyu Huang, Zhengxin Zhang, Claire Cardie

Main category: cs.CL

TL;DR: HAPO is a method that uses historical length information to train LLMs to generate more concise reasoning while maintaining accuracy, achieving 33-59% length reduction with minimal accuracy loss.

DetailsMotivation: Current test-time scaling approaches for LLMs improve reasoning but produce verbose outputs and increase inference costs, without leveraging historical problem-solving information to progressively improve conciseness.

Method: HAPO tracks history states (minimum length of previous correct responses) and uses a novel length reward function to incentivize discovering more concise correct solutions, combined with correctness reward for joint optimization.

Result: HAPO-trained models achieved 33-59% length reduction on math benchmarks with only 2-5% accuracy drops, demonstrating effective induction of concise reasoning abilities.

Conclusion: HAPO successfully enables LLMs to progressively generate more efficient solutions by leveraging historical information, balancing correctness and conciseness through its reward structure.

Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

[87] Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann

Main category: cs.CL

TL;DR: Lang1 models pretrained on EHR data outperform general models on healthcare operational tasks after finetuning, showing specialized LLMs can compete with much larger generalist models when properly trained and evaluated on domain-specific tasks.

DetailsMotivation: General foundation models lack specialized knowledge for hospital operational decisions, requiring domain-specific training to handle critical healthcare tasks like readmission prediction and mortality forecasting.

Method: Developed Lang1 models (100M-7B parameters) pretrained on specialized corpus combining 80B clinical tokens from EHRs and 627B internet tokens, evaluated using ReMedE benchmark with 668,331 EHR notes across five critical healthcare tasks.

Result: After finetuning, Lang1-1B outperformed finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. Models showed cross-task scaling and effective transfer to out-of-distribution settings.

Conclusion: Effective healthcare AI requires in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks, demonstrating specialized LLMs can compete with generalist models in specialized healthcare tasks.

Abstract: Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health’s EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

[88] DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

Chengyu Huang, Tanya Goyal

Main category: cs.CL

TL;DR: The paper introduces DCRM, a metric to measure preference dataset quality for preference optimization, and proposes a best-of-N² pairing method that improves model performance.

DetailsMotivation: Current preference optimization methods don't adequately consider whether the differences between preferred and dispreferred responses match what models should learn.

Method: Use distance and reward margin to quantify response differences, combine them into DCRM metric, and propose best-of-N² pairing to select high-DCRM response pairs.

Result: Higher DCRM correlates with better learning outcomes. The proposed method produces datasets that improve performance on AlpacaEval, MT-Bench, and Arena-Hard benchmarks.

Conclusion: DCRM effectively measures preference dataset quality, and selecting high-DCRM pairs through best-of-N² pairing can enhance preference optimization performance.

Abstract: Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models’ performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.

[89] Historical/temporal necessities/possibilities, and a logical theory of them in branching time

Fengkui Ju, Woxuan Zhou

Main category: cs.CL

TL;DR: The paper analyzes six modal notions of necessity and possibility related to time flow, develops a logical theory for them in branching time, and provides a sound and complete axiomatic system.

DetailsMotivation: To formalize linguistic perspectives on time flow by distinguishing between historical and temporal notions of necessity and possibility, focusing on how agents reason about expected and accepted timelines.

Method: The approach defines six modal operators based on an agent’s system of ontic rules that determine expected and accepted timelines. The logical theory uses branching time models with evaluation contexts including time flow models, ontic rules, timelines, and instants.

Result: The paper successfully formalizes the six notions of necessity and possibility, provides their logical semantics in branching time, and develops a complete axiomatic system for reasoning about these temporal modalities.

Conclusion: The framework offers a comprehensive logical treatment of temporal and historical modalities from a linguistic perspective, enabling formal reasoning about different types of necessity and possibility in branching time structures.

Abstract: In this paper, we do three kinds of work. First, we recognize four notions of necessity and two notions of possibility related to time flow, namely strong/weak historical/temporal necessities, as well as historical/temporal possibilities, which are motivated more from a linguistic perspective than from a philosophical one. Strong/weak historical necessities and historical possibility typically concern the possible futures of the present world, and strong/weak temporal necessities and temporal possibility concern possible timelines of alternatives of the present world. Second, we provide our approach to the six notions and present a logical theory of them in branching time. Our approach to the six notions is as follows. The agent has a system of ontic rules that determine expected timelines. She treats some ontic rules as undefeatable, determining accepted timelines. The domains of strong/weak historical necessities, respectively, consist of accepted and expected timelines passing through the present moment, and historical possibility is the dual of strong historical necessity. The domains of strong/weak temporal necessities, respectively, consist of accepted and expected timelines, and temporal possibility is the dual of strong temporal necessity. The logical theory has six operators: a last-moment operator, a next-moment operator, and four operators for the four notions of necessity. Formulas’ evaluation contexts consist of a tree-like model representing a time flow, a context representing the agent’s system of ontic rules, a timeline, and an instant. Third, we offer an axiomatic system for the logical theory and show its soundness and completeness.

[90] Simultaneous Machine Translation with Large Language Models

Minghan Wang, Jinming Zhao, Thuy-Trang Vu, Fatemeh Shiri, Ehsan Shareghi, Gholamreza Haffari

Main category: cs.CL

TL;DR: LLMs outperform dedicated MT models in simultaneous machine translation with better BLEU scores and latency, but face computational cost challenges.

DetailsMotivation: Real-world SimulMT systems need to handle noisy input, long contexts, and knowledge injection beyond just quality-latency trade-offs, requiring stronger language capabilities than dedicated MT models typically offer.

Method: Applied LLMs to SimulMT using existing incremental-decoding methods with a new RALCP algorithm for latency reduction, tested on Llama2-7b-chat model across nine languages from MUST-C dataset.

Result: LLM outperforms dedicated MT models in both BLEU and LAAL metrics, showing advantages in tuning efficiency and robustness, though computational cost remains a significant obstacle.

Conclusion: LLMs show promise for SimulMT with superior performance and robustness, but computational efficiency needs improvement for practical applications.

Abstract: Real-world simultaneous machine translation (SimulMT) systems face more challenges than just the quality-latency trade-off. They also need to address issues related to robustness with noisy input, processing long contexts, and flexibility for knowledge injection. These challenges demand models with strong language understanding and generation capabilities which may not often equipped by dedicated MT models. In this paper, we investigate the possibility of applying Large Language Models (LLM) to SimulMT tasks by using existing incremental-decoding methods with a newly proposed RALCP algorithm for latency reduction. We conducted experiments using the \texttt{Llama2-7b-chat} model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics. Further analysis indicates that LLM has advantages in terms of tuning efficiency and robustness. However, it is important to note that the computational cost of LLM remains a significant obstacle to its application in SimulMT.

[91] Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

Fatema Tuj Johora Faria, Mukaffi Bin Moin, Ahmed Al Wase, Mehidi Ahmmed, Md. Rabius Sani, Tashreef Muhammad

Main category: cs.CL

TL;DR: This paper addresses the gap in translating Bangla regional dialects to standard Bangla by creating a dataset of 32,500 sentences and proposing two novel models: DialectBanglaT5 for translation and DialectBanglaBERT for region detection.

DetailsMotivation: There has been extensive research on Bangla-English translation but a noticeable gap in translating Bangla regional dialects into standard Bangla, despite the linguistic diversity of Bangla dialects contributing to cultural richness.

Method: Created a dataset of 32,500 sentences covering Bangla, Banglish, and English from five regional dialects. Proposed two novel models: DialectBanglaT5 for dialect-to-standard Bangla translation and DialectBanglaBERT for dialect region classification.

Result: DialectBanglaT5 achieved highest BLEU score of 71.93, METEOR of 0.8503, and lowest WER of 0.1470 and CER of 0.0791 on Mymensingh dialect. DialectBanglaBERT achieved 89.02% overall region classification accuracy with F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh.

Conclusion: This is the first large-scale investigation of Bangla regional dialect translation and region detection. The proposed models demonstrate the potential of dialect-specific modeling and set new benchmarks for low-resource and dialect-rich language research.

Abstract: The Bangla linguistic variety is a fascinating mix of regional dialects that contributes to the cultural diversity of the Bangla-speaking community. Despite extensive study into translating Bangla to English, English to Bangla, and Banglish to Bangla in the past, there has been a noticeable gap in translating Bangla regional dialects into standard Bangla. In this study, we set out to fill this gap by creating a collection of 32,500 sentences, encompassing Bangla, Banglish, and English, representing five regional Bangla dialects. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. To tackle the translation and region detection tasks, we propose two novel models: DialectBanglaT5 for translating regional dialects into standard Bangla and DialectBanglaBERT for identifying the dialect’s region of origin. DialectBanglaT5 demonstrates superior performance across all dialects, achieving the highest BLEU score of 71.93, METEOR of 0.8503, and the lowest WER of 0.1470 and CER of 0.0791 on the Mymensingh dialect. It also achieves strong ROUGE scores across all dialects, indicating both accuracy and fluency in capturing dialectal nuances. In parallel, DialectBanglaBERT achieves an overall region classification accuracy of 89.02%, with notable F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh, confirming its effectiveness in handling regional linguistic variation. This is the first large-scale investigation focused on Bangla regional dialect translation and region detection. Our proposed models highlight the potential of dialect-specific modeling and set a new benchmark for future research in low-resource and dialect-rich language settings.

[92] Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

Minghan Wang, Thuy-Trang Vu, Yuxia Wang, Ehsan Shareghi, Gholamreza Haffari

Main category: cs.CL

TL;DR: Proposes a conversational SimulMT framework using multi-turn-dialogue-based decoding to improve LLM inference efficiency for simultaneous machine translation.

DetailsMotivation: LLMs achieve good SimulMT performance but suffer from high inference cost and latency, creating a need for more efficient approaches.

Method: Multi-turn-dialogue-based decoding framework for conversational simultaneous machine translation using LLMs like Llama2-7b-chat.

Result: Achieves superior translation quality compared to specialized SimulMT models while maintaining comparable computational latency on two benchmarks.

Conclusion: The conversational framework successfully enhances LLM inference efficiency for SimulMT, balancing quality and latency effectively.

Abstract: Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.

[93] DataGen: Unified Synthetic Dataset Generation via Large Language Models

Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Jianfeng Gao, Chaowei Xiao, Lichao Sun, Xiangliang Zhang

Main category: cs.CL

TL;DR: DataGen is an LLM-powered framework that generates diverse, accurate, and controllable datasets to address challenges in synthetic data generation, with applications in benchmarking and data augmentation.

DetailsMotivation: To overcome limitations in generalization, controllability, diversity, and truthfulness in existing LLM-based data generation frameworks.

Method: Uses attribute-guided generation and group checking for diversity; code-based mathematical assessment and retrieval-augmented generation for accuracy; user-specified constraints for customization.

Result: Produces superior quality datasets that enhance LLM benchmarking and improve capabilities in agent-oriented abilities and reasoning through data augmentation.

Conclusion: DataGen effectively addresses key challenges in synthetic data generation and supports dynamic benchmarking while enhancing LLM performance across multiple domains.

Abstract: Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DataGen, and each module within DataGen plays a critical role in this enhancement. Additionally, DataGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DataGen effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

[94] ProFuser: Progressive Fusion of Large Language Models

Tianyuan Shi, Fanqi Wan, Canbin Huang, Xiaojun Quan, Chenliang Li, Ming Yan, Ji Zhang, Minhua Huang, Wu Kai

Main category: cs.CL

TL;DR: ProFuser is a novel model fusion method that evaluates model advantages using both training and inference modes, progressively transitioning between them to create more powerful and versatile language models.

DetailsMotivation: Existing fusion methods primarily use cross entropy on ground truth in teacher-forcing setup, which provides limited insight into model advantages. The paper aims to develop a more comprehensive assessment approach.

Method: Introduces ProFuser that evaluates model advantage through both cross entropy during training and inference outputs, with progressive transition from inference mode to training mode.

Result: Fused three models (Vicuna-7B-v1.5, Llama-2-7B-Chat, MPT-7B-8K-Chat) and demonstrated improved performance in knowledge, reasoning, and safety compared to baseline methods.

Conclusion: ProFuser effectively enhances model fusion by incorporating both training and inference modes, providing a more comprehensive approach to model advantage assessment.

Abstract: While fusing the capacities and advantages of various large language models offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model’s advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser’s effectiveness, we fused three models, including Vicuna-7B-v1.5, Llama-2-7B-Chat, and MPT-7B-8K-Chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.

[95] On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

Simon Kurz, Jian-Jia Chen, Lucie Flek, Zhixue Zhao

Main category: cs.CL

TL;DR: This paper investigates how pruning multilingual LLMs affects performance across different languages, finding that while target-language calibration maintains perplexity, it doesn’t consistently improve downstream tasks due to loss of nuanced language-agnostic features.

DetailsMotivation: Previous LLM pruning research focused mainly on English text, despite multilingual LLMs being commonly used for non-English applications. The study aims to understand how pruning affects multilingual models across different languages and tasks.

Method: Conducted comprehensive empirical study comparing different calibration languages for pruning across diverse languages, tasks, models, and state-of-the-art pruning techniques. Analyzed latent subspaces, pruning masks, and individual neurons within pruned models.

Result: Calibration on target language effectively retains perplexity and yields high signal-to-noise ratios but does not consistently improve downstream task performance. Analysis reveals pruning preserves dominant language-specific features but loses nuanced language-agnostic features crucial for knowledge retention and reasoning.

Conclusion: Current pruning approaches have limitations - while they effectively preserve language-specific information, this is insufficient to counteract the loss of crucial language-agnostic features needed for knowledge and reasoning capabilities.

Abstract: Recent advances in large language model (LLM) pruning have shown state-of-the-art (SotA) compression results in post-training and retraining-free settings while maintaining high predictive performance. However, previous research mainly considered calibrating based on English text, despite the multilingual nature of modern LLMs and their frequent use in non-English languages. This analysis paper conducts an in-depth investigation of the performance and internal representation changes associated with pruning multilingual language models for monolingual applications. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. We further analyze the latent subspaces, pruning masks, and individual neurons within pruned models. Our results reveal that while calibration on the target language effectively retains perplexity and yields high signal-to-noise ratios, it does not consistently improve downstream task performance. Further analysis of internal representations at three different levels highlights broader limitations of current pruning approaches: While they effectively preserve dominant information like language-specific features, this is insufficient to counteract the loss of nuanced, language-agnostic features that are crucial for knowledge retention and reasoning.

[96] Contextual Breach: Assessing the Robustness of Transformer-based QA Models

Asir Saadat, Nahian Ibn Asad

Main category: cs.CL

TL;DR: Created a dataset with 7 types of adversarial noise at 5 intensity levels on SQuAD to test QA model robustness using standardized metrics.

DetailsMotivation: QA models are vulnerable to adversarial perturbations in real-world contexts, which degrade performance by distorting textual input.

Method: Developed a dataset with adversarial noise variations, applied standardized robustness metrics, and tested transformer-based QA models.

Result: Experiments revealed robustness vulnerabilities and provided insights into model performance under realistic noisy conditions.

Conclusion: The study highlights significant robustness issues in QA models and offers a framework for evaluating model resilience to adversarial text perturbations.

Abstract: Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model’s performance in realistic textual input.

[97] Is deeper always better? Replacing linear mappings with deep learning networks in the Discriminative Lexicon Model

Maria Heitmeier, Valeria Schmidt, Hendrik P. A. Lensch, R. Harald Baayen

Main category: cs.CL

TL;DR: This study compares linear vs deep learning approaches in cognitive language modeling, finding deep learning (DDL) works better for large datasets but linear methods (LDL/FIL) are more effective for incremental learning and some languages.

DetailsMotivation: To determine if deep learning can provide better understanding of language learning problems beyond linear methods in cognitive modeling of language.

Method: Replaced linear mappings in Discriminative Lexicon Model with deep dense neural networks (DDL), comparing performance across English, Dutch, Estonian, and Taiwan Mandarin datasets, and testing frequency-informed variants.

Result: DDL outperforms linear methods for large datasets (English/Dutch) and pseudo-morphological words, but underperforms for Estonian/Mandarin. Frequency-informed DDL (FIDDL) beats frequency-informed linear mappings (FIL), but linear methods are better for incremental learning.

Conclusion: Both linear and deep mappings are currently informative for understanding language, with each having specific strengths depending on dataset size, language characteristics, and learning requirements.

Abstract: Recently, deep learning models have increasingly been used in cognitive modelling of language. This study asks whether deep learning can help us to better understand the learning problem that needs to be solved by speakers, above and beyond linear methods. We utilise the Discriminative Lexicon Model introduced by Baayen and colleagues, which models comprehension and production with mappings between numeric form and meaning vectors. While so far, these mappings have been linear (Linear Discriminative Learning, LDL), in the present study we replace them with deep dense neural networks (Deep Discriminative Learning, DDL). We find that DDL affords more accurate mappings for large and diverse datasets from English and Dutch, but not necessarily for Estonian and Taiwan Mandarin. DDL outperforms LDL in particular for words with pseudo-morphological structure such as chol+er. Applied to average reaction times, we find that DDL is outperformed by frequency-informed linear mappings (FIL). However, DDL trained in a frequency-informed way (‘frequency-informed’ deep learning, FIDDL) substantially outperforms FIL. Finally, while linear mappings can very effectively be updated from trial-to-trial to model incremental lexical learning, deep mappings cannot do so as effectively. At present, both linear and deep mappings are informative for understanding language.

[98] Uncovering Factor Level Preferences to Improve Human-Model Alignment

Juhyun Oh, Eunsu Kim, Jiseon Kim, Wenda Xu, Inha Cha, William Yang Wang, Alice Oh

Main category: cs.CL

TL;DR: PROFILE is an automated framework that identifies factor-level preference misalignments between LLMs and humans, revealing a generation-discrimination gap that can be leveraged to improve LLM alignment.

DetailsMotivation: LLMs often diverge from human preferences in writing styles and verbosity, but existing evaluation methods lack explainability and fine-grained analysis to identify the root causes of these misalignments.

Method: PROFILE framework analyzes factor-level preference alignment across summarization, instruction-following, and document-based QA tasks through automated measurement of generation vs. discrimination capabilities.

Result: Significant discrepancy found: LLMs show poor factor-level alignment in text generation but strong alignment in discrimination tasks, revealing a generation-discrimination gap that can be exploited for improvement.

Conclusion: Factor-level analysis is valuable for identifying hidden misalignments, and the generation-discrimination gap provides practical opportunities to enhance LLM-human preference alignment through methods like fine-tuning with self-guidance.

Abstract: Large language models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods’ reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE, an automated framework to uncover and measure factor-level preference alignment of humans and LLMs. Using PROFILE, we analyze preference alignment across three key tasks: summarization, instruction-following, and document-based QA. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in discrimination tasks. We demonstrate how leveraging the identified generation-discrimination gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance. Our work highlights the value of factor-level analysis for identifying hidden misalignments and provides a practical framework for improving LLM-human preference alignment.

[99] Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot

Herman Lassche, Michiel Overeem, Ayushi Rastogi

Main category: cs.CL

TL;DR: This study defines correctness for LLM-generated customer support responses in Dutch, develops automated assessment metrics based on support team decision-making, and achieves 55% accuracy in identifying wrong messages.

DetailsMotivation: AFAS wants to use LLMs to answer customer queries with minimal human input, but faces challenges in defining correctness for Dutch responses and assessing LLM output quality with limited training data.

Method: Leveraged NLG and automated answer grading literature to automate customer support team decision-making. Tested on binary questions and instructional questions, defining correctness based on how the support team makes decisions.

Result: The approach can identify wrong messages in 55% of cases, demonstrating potential for automatically assessing when chatbots provide incorrect or misleading answers.

Conclusion: The work contributes a definition and metrics for assessing correctness, plus suggestions to improve correctness regarding regional language and question type, showing promise for automated quality assessment of LLM-generated customer support responses.

Abstract: Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55% of the cases. This work demonstrates the potential for automatically assessing when our chatbot may provide incorrect or misleading answers. Specifically, we contribute (1) a definition and metrics for assessing correctness, and (2) suggestions to improve correctness with respect to regional language and question type.

[100] Unveiling Topological Structures from Language: A Survey of Topological Data Analysis Applications in NLP

Adaku Uchendu, Thai Le

Main category: cs.CL

TL;DR: Survey of 100 papers applying Topological Data Analysis (TDA) to NLP, categorizing approaches into theoretical (explaining linguistic phenomena) and non-theoretical (combining TDA with ML features).

DetailsMotivation: ML faces challenges with real-world data (imbalance, noise, insufficient labeling, high dimensionality), while TDA can capture data shape despite noise, but has limited adoption in NLP compared to other domains.

Method: Comprehensive survey of 100 papers, categorizing TDA applications in NLP into theoretical approaches (topological explanations of linguistic phenomena) and non-theoretical approaches (TDA combined with ML features using various numerical representations).

Result: Identified and categorized existing research efforts, showing TDA’s potential for NLP despite limited adoption. Created taxonomy of approaches and compiled resources.

Conclusion: TDA shows promise for NLP but faces challenges and unresolved questions. The field remains niche but has dedicated research community exploring its applications.

Abstract: The surge of data available on the Internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 100 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and non-theoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field. Resources and a list of papers on this topic can be found at: https://github.com/AdaUchendu/AwesomeTDA4NLP.

[101] Understanding World or Predicting Future? A Comprehensive Survey of World Models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong Li

Main category: cs.CL

TL;DR: This survey provides a comprehensive review of world models in AI, categorizing them into two main functions: understanding current world states and predicting future dynamics, with applications in gaming, autonomous driving, robotics, and social simulation.

DetailsMotivation: The motivation stems from recent advancements in multimodal LLMs like GPT-4 and video generation models like Sora, which have renewed interest in world models as crucial components for achieving artificial general intelligence.

Method: The survey systematically categorizes world models into two primary functions: (1) constructing internal representations to understand world mechanisms, and (2) predicting future states for simulation and decision-making guidance. It examines current progress and explores applications across various domains.

Result: The review presents a comprehensive taxonomy of world models, analyzes their applications in key domains (generative games, autonomous driving, robotics, social simulacra), and provides a curated collection of representative papers with code repositories.

Conclusion: The survey identifies key challenges in world model development and outlines potential future research directions, emphasizing the importance of world models for advancing artificial general intelligence.

Abstract: The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua-fib-lab/World-Model.

[102] Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script

Xi Cao, Yuan Sun, Jiajun Li, Quzong Gesang, Nuo Qun, Tashi Nyima

Main category: cs.CL

TL;DR: HITL-GAT is a human-in-the-loop system for generating high-quality adversarial texts in lower-resourced languages, addressing challenges like linguistic differences, invalid generation, and model evolution.

DetailsMotivation: DNN language models are vulnerable to adversarial attacks, but existing work is English-centric, leaving lower-resourced languages understudied due to linguistic differences, limited resources, and generation quality issues.

Method: Interactive human-in-the-loop system (HITL-GAT) with three customized adversarial text generation methods, demonstrated through a Tibetan script case study.

Result: Established the first adversarial robustness benchmark for Tibetan script, providing a reference for other lower-resourced languages.

Conclusion: HITL-GAT effectively addresses challenges in adversarial text generation for lower-resourced languages and enables creation of sustainable robustness benchmarks.

Abstract: DNN-based language models excel across various NLP tasks but remain highly vulnerable to textual adversarial attacks. While adversarial text generation is crucial for NLP security, explainability, evaluation, and data augmentation, related work remains overwhelmingly English-centric, leaving the problem of constructing high-quality and sustainable adversarial robustness benchmarks for lower-resourced languages both difficult and understudied. First, method customization for lower-resourced languages is complicated due to linguistic differences and limited resources. Second, automated attacks are prone to generating invalid or ambiguous adversarial texts. Last but not least, language models continuously evolve and may be immune to parts of previously generated adversarial texts. To address these challenges, we introduce HITL-GAT, an interactive system based on a general approach to human-in-the-loop generation of adversarial texts. Additionally, we demonstrate the utility of HITL-GAT through a case study on Tibetan script, employing three customized adversarial text generation methods and establishing its first adversarial robustness benchmark, providing a valuable reference for other lower-resourced languages.

[103] PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang

Main category: cs.CL

TL;DR: PathRAG improves graph-based RAG by retrieving key relational paths instead of redundant information, using flow-based pruning and path-based prompting to enhance LLM response quality.

DetailsMotivation: Current graph-based RAG methods suffer from redundant retrieved information and use flat structures in prompts, leading to suboptimal performance.

Method: Retrieves key relational paths from indexing graph, converts them to text, uses flow-based pruning to reduce redundancy, and employs path-based prompting for LLMs.

Result: Consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions.

Conclusion: PathRAG effectively reduces redundant information and guides LLMs to generate more logical and coherent responses through path-based retrieval and prompting.

Abstract: Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG

[104] Aligning Extraction and Generation for Robust Retrieval-Augmented Generation

Hwanjun Song, Jeonghwan Choi, Minseok Kim

Main category: cs.CL

TL;DR: Ext2Gen is an extract-then-generate framework that jointly performs evidence selection and answer generation to improve LLM robustness against retrieval noise, eliminating the need for separate compression modules.

DetailsMotivation: Retrieval-augmented generation suffers from noise in retrieved content and uncertain placement of relevant chunks, leading to hallucinations in LLM outputs.

Method: Joint evidence selection and answer generation framework optimized through preference alignment with curated pairwise feedback, dynamically identifying relevant content while suppressing noise.

Result: Ext2Gen substantially enhances generation robustness and outperforms methods using independent compression models like Recomp, CompAct, and EXIT, with additional benefits from improved retrieval techniques.

Conclusion: Generation-side enhancements can address limitations that retrieval alone cannot overcome, producing more accurate and faithful answers even under noisy retrieval conditions.

Abstract: Retrieval-augmented generation (RAG) enhances LLMs with external knowledge, yet generation remains vulnerable to retrieval-induced noise and uncertain placement of relevant chunks, often causing hallucinations. We present Ext2Gen, an extract-then-generate framework that strengthens LLMs via joint evidence selection and answer generation, dynamically identifying query-relevant content while suppressing noise, thereby removing the need for any independent pre-generation compression module. Optimized through preference alignment with well-curated pairwise feedback, Ext2Gen produces accurate and faithful answers even under noisy or imprecise retrieval. Experiments demonstrate that it substantially enhances the robustness of the generation backbone and yields greater performance gains than methods relying on independent compression models, e.g., Recomp, CompAct, EXIT). It further benefits from improved retrieval techniques such as query rewriting, underscoring that generation-side enhancements address limitations that retrieval alone cannot overcome.

[105] Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

Julian Spravil, Sebastian Houben, Sven Behnke

Main category: cs.CL

TL;DR: VLMs can learn zero-shot image captioning in languages only seen in translation tasks through cross-lingual transfer, following scaling laws influenced by model size, multilinguality, and training data.

DetailsMotivation: Address task-specific data scarcity in multilingual VLMs by enabling zero-shot image captioning in languages encountered only during translation training.

Method: Train encoder-decoder transformer VLMs (Florence-2 and Gemma-2 based, 0.4B-11.2B params) on synthetic dataset with image-aligned translations, using language prefixes to enable captioning emergence.

Result: Captioning emerges in languages only seen in translation tasks, following scaling laws based on model multilinguality, size, and seen samples; competitive performance achieved on downstream tasks.

Conclusion: Cross-lingual transfer enables zero-shot capabilities in VLMs, with scaling laws governing indirect learning of unseen task-language pairs.

Abstract: Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

[106] Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

Main category: cs.CL

TL;DR: The paper introduces a multimodal solution explanation task to evaluate LLMs’ ability to identify visual keypoints in math problems and generate explanations that incorporate these visual elements, highlighting current models’ limitations in mathematical visual grounding.

DetailsMotivation: Current LLM-generated explanations lack multimodal elements like diagrams and visual aids that human tutors routinely use, creating a gap in educational AI systems' ability to provide comprehensive explanations.

Method: Proposed a multimodal solution explanation task and created ME2 benchmark with 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references visual elements.

Result: Current models struggle to identify visual keypoints and open-source models face notable difficulties in generating keypoint-based explanations, revealing significant gaps in mathematical visual grounding and visually grounded reasoning.

Conclusion: The multimodal solution explanation task and ME2 dataset will catalyze research on LLMs in education and promote their use as effective, explanation-oriented AI tutors by addressing visual reasoning capabilities.

Abstract: With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs’ ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: TathyaNyaya is the largest annotated dataset for Fact-based Judgment Prediction and Explanation (FJPE) in Indian legal context, paired with FactLegalLlama, an instruction-tuned LLM for generating high-quality legal explanations.

DetailsMotivation: To develop robust AI-driven legal decision-making tools that rely on factual data rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes.

Method: Created TathyaNyaya dataset from Supreme Court and High Court judgments focusing on factual statements, and developed FactLegalLlama by instruction-tuning LLaMa-3-8B on this dataset. Used transformers for binary judgment prediction combined with FactLegalLlama for explanation generation.

Result: TathyaNyaya surpasses existing datasets in scale and diversity, establishing benchmarks for explainable AI in legal analysis. The framework enhances predictive performance and interpretability through factual precision and domain-specific tuning.

Conclusion: TathyaNyaya and FactLegalLlama serve as foundational resources for AI-assisted legal decision-making, emphasizing the importance of factual precision and domain-specific approaches for transparency and interpretability in legal AI systems.

Abstract: In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms “Tathya” (fact) and “Nyaya” (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.

[108] Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Robust Response Generation in the Wild

Jiatai Wang, Zhiwei Xu, Di Jin, Xuewen Yang, Tao Li

Main category: cs.CL

TL;DR: Swin-VIB is a novel framework that uses variational information bottleneck models to help LLMs handle knowledge conflicts between internal memory and external information, improving response reliability and reducing uncertainty.

DetailsMotivation: LLMs often face knowledge conflicts from misinformation, biases, or outdated knowledge, which undermine response reliability and introduce uncertainty in decision-making.

Method: The proposed Swin-VIB framework integrates a pipeline of variational information bottleneck models to adapt the retrieved information difference, facilitating robust response generation in conflicting contexts.

Result: Extensive experiments show Swin-VIB outperforms all competitive baselines in multiple-choice task accuracy and improves EM values in open-ended QA tasks by at least 11.14%.

Conclusion: The framework effectively addresses knowledge conflicts in LLMs by leveraging information-theoretic principles to reduce uncertainty and improve response reliability.

Abstract: The proliferation of large language models (LLMs) has significantly advanced intelligent systems. Unfortunately, LLMs often face knowledge conflicts between internal memory and retrieved external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences and alleviate the uncertainty during their response generation. When this difference is ambiguous, LLMs experience considerable uncertainty about their generation. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models to adapt the retrieved information difference, facilitating robust response generation of LLMs even in conflicting contexts. Extensive experiments confirm our theoretical analysis and demonstrate the performance of Swin-VIB. Notably, Swin-VIB outperforms all competitive baselines in terms of the accuracy of the multiple-choice task, while improving the EM values in the open-ended QA task by at least 11.14%.

[109] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, Zhanpeng Jin

Main category: cs.CL

TL;DR: FRAME is a framework that improves automated medical paper generation using iterative refinement and structured feedback from three specialized agents, achieving quality comparable to human-authored papers.

DetailsMotivation: To address challenges in knowledge synthesis and quality assurance for automated scientific research using LLMs, particularly in the medical domain where accuracy and rigor are critical.

Method: Three-component architecture: Generator creates content, Evaluator assesses quality using metrics, Reflector provides improvement feedback. Built on structured dataset of 4,287 medical papers decomposed into research components.

Result: Significant improvements over conventional approaches (9.91% average gain with DeepSeek V3, comparable with GPT-4o Mini). Human evaluation shows quality comparable to human-authored papers, especially strong in synthesizing future research directions.

Conclusion: FRAME provides a robust foundation for automated medical research paper generation that maintains academic standards and can efficiently assist medical research.

Abstract: The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME’s effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.

[110] The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

Hiram Ring

Main category: cs.CL

TL;DR: The taggedPBC is a large POS-tagged parallel dataset covering 1,940+ languages across 155 families and 78 isolates, enabling better crosslinguistic research by addressing limitations of existing datasets.

DetailsMotivation: To overcome limitations in existing crosslinguistic datasets that either have large data for few languages or small data for many languages, constraining universal language property claims.

Method: Developed a large tagged parallel dataset (taggedPBC) with POS-tagged text data from diverse languages, validated against SOTA taggers and hand-tagged corpora, and introduced the N1 ratio measure for word order analysis.

Result: The dataset shows high accuracy correlation with existing taggers and hand-tagged corpora. The N1 ratio correlates with expert word order determinations, enabling accurate classification of intransitive word order using Gaussian Naive Bayes.

Conclusion: taggedPBC represents a significant advancement for corpus-based crosslinguistic research, available via GitHub for collaboration, though further expansion is needed.

Abstract: Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

[111] Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering

Yihua Zhu, Qianying Liu, Akiko Aizawa, Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: PDRR is a four-stage KBQA framework that predicts question types, decomposes questions into triples, retrieves KB information, and uses LLM reasoning to handle both simple and complex questions better than existing methods.

DetailsMotivation: LLM-only KBQA approaches suffer from outdated knowledge, hallucinations, and lack of transparency, while chain-based KG-RAG methods are limited to simple chain-structured questions due to lack of planning and logical structuring.

Method: Proposed PDRR framework with four stages: Predict (question type), Decompose (into structured triples), Retrieve (from KBs), and Reason (using LLM as agent to complete triples).

Result: Experimental results show PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.

Conclusion: The PDRR framework effectively addresses limitations of both LLM-only and chain-based approaches by incorporating planning and logical structuring through semantic parsing-inspired decomposition.

Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results demonstrate that PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.

[112] Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation

Yerin Hwang, Dongryeol Lee, Kyungmin Min, Taegwan Kang, Yong-il Kim, Kyomin Jung

Main category: cs.CL

TL;DR: LVLM judges are vulnerable to adversarial visual manipulations that systematically inflate text-image alignment scores, revealing critical robustness issues in current evaluation systems.

DetailsMotivation: Large vision-language models (LVLMs) are widely used for text-image alignment evaluation, but their robustness against visual manipulations remains underexplored. The study investigates whether adversarial visual biases can systematically fool LVLM judges.

Method: Introduced FRAME benchmark with diverse score distributions, defined visual biases, and tested LVLM judges’ vulnerability by injecting these biases into the benchmark. Also examined combined biases, pairwise evaluations, and prompt-based mitigation strategies.

Result: All tested LVLM judges showed vulnerability across all domains, consistently inflating scores for manipulated images. Combining multiple biases amplified effects, and visual biases persisted despite mitigation attempts.

Conclusion: Current LVLM evaluation systems have significant vulnerability to visual manipulations, highlighting the urgent need for more robust LVLM judges that can resist adversarial visual biases.

Abstract: Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist under prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.

[113] Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli

Main category: cs.CL

TL;DR: First Persian medical dataset created with 20k doctor-patient Q&A pairs and 54M tokens from medical magazines. Fine-tuned aya-expanse-8b model achieved improved medical QA accuracy, passed Iranian medical exam, and boosted MMLU performance by 2.67%.

DetailsMotivation: Small language models struggle with specialized domains in low-resource languages like Persian, and no curated medical dataset existed for Persian despite available online medical content.

Method: Parameter-efficient fine-tuning using newly curated dataset of 20k doctor-patient Q&A pairs and 60% of 90M-token crawled corpus from Persian medical magazines on aya-expanse-8b baseline model.

Result: Fine-tuned model achieved improved medical question answering accuracy, successfully passed Iranian Basic Medical Science Entrance Exam (which baseline failed), and improved Persian-translated MMLU accuracy by average 2.67%.

Conclusion: Open-access online data can effectively enrich small language models for medical fields in low-resource languages, providing practical solutions for resource-constrained environments.

Abstract: The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q&A pairs and 60% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.

[114] Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models

Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz

Main category: cs.CL

TL;DR: ASE is a novel inference-time framework that uses Chain-of-Thought reasoning to enhance LLM safety by guiding models to contemplate adversarial scenarios before responding, achieving near-zero jailbreak rates while maintaining high usability.

DetailsMotivation: LLMs face growing safety risks like jailbreaks, toxic content, hallucinations, and bias, while existing defenses are either too narrow or sacrifice user experience through rigid rejection mechanisms.

Method: Adversarial Scenario Extrapolation (ASE) framework that leverages Chain-of-Thought reasoning to guide LLMs through self-generative processes of contemplating potential adversarial scenarios and formulating defensive strategies before responding to user queries.

Result: Achieves near-zero jailbreak attack success rates and minimal toxicity, reduces outright rejections to <4%, outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores.

Conclusion: ASE transforms adversarial perception into an intrinsic cognitive process, setting a new paradigm for secure and natural human-AI interaction by simultaneously enhancing robustness and seamlessness.

Abstract: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

[115] SCRum-9: Multilingual Stance Classification over Rumours on Social Media

Yue Li, Jake Vasilakes, Zhixue Zhao, Carolina Scarton

Main category: cs.CL

TL;DR: SCRum-9 is the largest multilingual stance classification dataset for rumour analysis across 9 languages, containing 7,516 tweets linked to 2.1k fact-checked claims with confidence annotations. The dataset enables benchmarking of LLMs and MLMs, explores synthetic data generation, and reveals model predictions often align with human second-choice labels.

DetailsMotivation: To address limitations in existing stance classification datasets by creating a more comprehensive multilingual resource that accounts for annotator variability, covers more languages, and links examples to verified claims for rumour analysis.

Method: Created SCRum-9 dataset with 7,516 tweets across 9 languages, annotated by native speakers with confidence scores. Benchmarked 5 LLMs and 2 MLMs using ICL and fine-tuning. Explored multilingual synthetic data generation from LLMs for fine-tuning smaller models.

Result: LLMs with weak ICL performance can generate valuable synthetic data that enables small MLMs to outperform zero-shot ICL in LLMs. Model predictions often match human annotators’ second-choice labels rather than diverging from human judgments.

Conclusion: SCRum-9 provides a valuable resource for multilingual rumour analysis research. Synthetic data from LLMs can effectively boost performance of smaller models, and model predictions show alignment with human uncertainty patterns in ambiguous cases.

Abstract: We introduce SCRum-9, the largest multilingual Stance Classification dataset for Rumour analysis in 9 languages, containing 7,516 tweets from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages, linking examples to more fact-checked claims (2.1k), and including confidence-related annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least two native speakers per language, totalling more than 405 hours of annotation and 8,150 dollars in compensation. Further, SCRum-9 is used to benchmark five large language models (LLMs) and two multilingual masked language models (MLMs) in In-Context Learning (ICL) and fine-tuning setups. This paper also innovates by exploring the use of multilingual synthetic data for rumour stance classification, showing that even LLMs with weak ICL performance can produce valuable synthetic data for fine-tuning small MLMs, enabling them to achieve higher performance than zero-shot ICL in LLMs. Finally, we examine the relationship between model predictions and human uncertainty on ambiguous cases finding that model predictions often match the second-choice labels assigned by annotators, rather than diverging entirely from human judgments. SCRum-9 is publicly released to the research community with potential to foster further research on multilingual analysis of misleading narratives on social media.

[116] T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search

Xing Cui, Yueying Zou, Zekun Li, Peipei Li, Xinyuan Xu, Xuannan Liu, Huaibo Huang

Main category: cs.CL

TL;DR: A misinformation detection agent called \method that uses Monte Carlo Tree Search with an extensible toolkit to dynamically verify mixed-source multimodal misinformation.

DetailsMotivation: Existing methods use static pipelines and limited tools, which are inadequate for handling the complexity and diversity of real-world multimodal misinformation from mixed forgery sources.

Method: Proposes \method with an extensible toolkit (web search, forgery detection, consistency analysis) and MCTS with multi-source verification. Uses greedy search to select relevant tools, then MCTS with dual rewards for balanced exploration-exploitation.

Result: Extensive experiments show \method consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks.

Conclusion: \method demonstrates strong potential as a training-free detector for complex multimodal misinformation, with effective tree search mechanisms and tool usage validated through ablation studies.

Abstract: Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose \method, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a greedy search-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, \method~ extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that \method~ consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free detector.

[117] Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: LAQ proposes a novel KV cache eviction framework using pseudo lookahead queries to better approximate decoding-stage queries, achieving more consistent and accurate cache management under memory constraints.

DetailsMotivation: Existing KV cache eviction methods use prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets where KV cache memory usage grows substantially with longer sequences.

Method: LAQ generates low-cost pseudo lookahead queries to serve as observation windows for importance estimation, enabling more accurate KV cache eviction aligned with real inference scenarios.

Result: Experimental results on LongBench and Needle-in-a-Haystack benchmarks show LAQ outperforms existing methods across various budget levels, achieving 1-4 point improvement on LongBench under limited cache budget.

Conclusion: LAQ provides a more consistent and accurate KV cache eviction framework that is complementary to existing approaches and can be flexibly combined for further improvements.

Abstract: Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.

[118] Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu, Shangzhe Li, Xinhua Zhang

Main category: cs.CL

TL;DR: A framework for temporal difference-based distillation that exploits the distributional sparsity of teacher language models by operating on reduced action spaces.

DetailsMotivation: Large language models have high computational costs, and distillation is used to compress them. Existing methods can be viewed as behavior cloning, inspiring the use of reinforcement learning techniques for more efficient distillation.

Method: Leverages the observation that language models assign most probability mass to few tokens. Designs a temporal difference learning framework that operates on a reduced action space (vocabulary subset) rather than the full vocabulary.

Result: Demonstrates how practical algorithms can be derived from this framework and shows resulting performance improvements in distillation.

Conclusion: The proposed temporal difference-based distillation framework effectively exploits teacher model sparsity to create more efficient distillation methods with improved performance.

Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

[119] REIC: RAG-Enhanced Intent Classification at Scale

Ziji Zhang, Michael Yang, Zhiyu Chen, Yingying Zhuang, Shu-Ting Pi, Qun Liu, Rajashekar Maragoud, Vy Nguyen, Anurag Beniwal

Main category: cs.CL

TL;DR: REIC is a retrieval-augmented generation approach for intent classification that outperforms traditional methods in large-scale customer service settings without frequent retraining.

DetailsMotivation: Address scalability challenges in intent classification as companies expand product lines, dealing with increasing intents and taxonomy variations across verticals.

Method: Uses retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge for precise classification.

Result: Outperforms traditional fine-tuning, zero-shot, and few-shot methods in both in-domain and out-of-domain scenarios on real-world datasets.

Conclusion: Demonstrates potential for real-world deployment in adaptive and large-scale intent classification systems.

Abstract: Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.

[120] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Main category: cs.CL

TL;DR: REFORM is a novel inference framework that efficiently processes extremely long contexts through two-phase chunk processing with compressed KV cache and selective recomputation, achieving significant performance gains and efficiency improvements.

DetailsMotivation: Address the challenge of processing extremely long contexts exceeding model limits, overcoming limitations of existing approaches like recurrent compression (poor information preservation) and random access (high memory requirements).

Method: Two-phase approach: 1) Incrementally process input chunks with compressed KV cache, cross-layer context embeddings, and early exit strategy; 2) Identify essential tokens via similarity matching and selectively recompute KV cache.

Result: Achieves over 52% and 34% performance gains on RULER and BABILong at 1M context length, outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, reduces inference time by 30% and peak memory usage by 5%.

Conclusion: REFORM demonstrates both efficiency and superior performance across diverse tasks and domains, providing an effective solution for long-context processing challenges.

Abstract: As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model’s pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 52% and 34% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

[121] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Luoxuan Weng, Yingchaojie Feng, Haozhe Feng, Minfeng Zhu, Bo Zhang, Wei Chen

Main category: cs.CL

TL;DR: Proposes Multimodal DeepResearcher, an agentic framework that generates interleaved text and visualizations for comprehensive reports using a structured visualization representation called FDV.

DetailsMotivation: Existing deep research frameworks focus only on text generation, leaving automated generation of multimodal reports with integrated visualizations underexplored despite visualizations being crucial for effective communication.

Method: Introduces Formal Description of Visualization (FDV) as structured textual representation for charts, and a 4-stage framework: researching, exemplar report textualization, planning, and multimodal report generation.

Result: Achieves 82% overall win rate over baseline using Claude 3.7 Sonnet model, with evaluation on MultimodalReportBench containing 100 diverse topics and 5 dedicated metrics.

Conclusion: The proposed framework effectively addresses the challenge of generating high-quality multimodal reports with integrated visualizations, demonstrating significant improvement over text-only approaches.

Abstract: Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82% overall win rate over the baseline method.

[122] Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Main category: cs.CL

TL;DR: STAND is a model-free speculative decoding method that accelerates language model reasoning by exploiting redundancy in reasoning paths, reducing latency by 60-65% without accuracy loss.

DetailsMotivation: Existing reasoning acceleration methods like best-of-N sampling and tree search require substantial computational resources, creating a performance-efficiency trade-off that needs addressing.

Method: Uses stochastic adaptive N-gram drafting with memory-efficient logit-based N-gram module, Gumbel-Top-K sampling, and data-driven tree construction to predict tokens without separate draft models.

Result: Reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy across multiple reasoning tasks (AIME-2024, GPQA-Diamond, LiveCodeBench).

Conclusion: STAND provides a plug-and-play solution that outperforms state-of-the-art speculative decoding methods and can be applied to any existing language model without additional training.

Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

[123] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: RACE is a novel framework for detecting hallucinations in Large Reasoning Models by evaluating reasoning-answer consistency through four diagnostic signals.

DetailsMotivation: Existing hallucination detection methods fail to detect logical inconsistencies in reasoning traces of Large Reasoning Models, which are key sources of potential hallucination.

Method: RACE extracts essential reasoning steps and computes four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning.

Result: Experiments across datasets and different LLMs show that RACE outperforms existing hallucination detection baselines.

Conclusion: RACE provides a robust and generalizable solution for evaluating Large Reasoning Models by detecting hallucinations in their reasoning traces.

Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, becoming a new and hard-to-detect source of hallucination. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model’s reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model’s decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. The joint utilization of these signals makes RACE a more robust detector of hallucinations in LRMs. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. The source code is available at https://github.com/bebr2/RACE

[124] MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

Ye Bai, Minghan Wang, Thuy-Trang Vu

Main category: cs.CL

TL;DR: MAPLE is a multi-agent framework that mimics human problem-solving for table-based QA, using specialized agents in a feedback loop with long-term memory to overcome LLM limitations in complex reasoning.

DetailsMotivation: Current LLMs struggle with complex table-based QA using single-pass inference, and existing approaches like Chain-of-Thought lack error detection and don't reuse problem-solving experiences, unlike human reasoning.

Method: MAPLE uses 4 specialized agents: Solver (ReAct reasoning), Checker (answer verification), Reflector (error diagnosis and strategy correction), and Archiver (long-term memory management for experience reuse and evolution).

Result: Experiments on WiKiTQ and TabFact show significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.

Conclusion: The MAPLE framework successfully mimics human problem-solving through multi-agent collaboration with feedback loops and long-term memory, effectively addressing limitations of current LLM approaches for complex table-based QA.

Abstract: Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.

[125] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Sixun Dong, Kunpeng Liu, Yanjie Fu

Main category: cs.CL

TL;DR: A lightweight post-training framework that refines latent reasoning trajectories using contrastive feedback and residual embedding refinement to improve reasoning accuracy without explicit intermediate steps.

DetailsMotivation: Chain-of-Thought prompting suffers from token overhead and fixed reasoning trajectories, while existing latent reasoning approaches lack effective methods for updating reasoning embeddings during post-training.

Method: Proposes two strategies: 1) Contrastive reasoning feedback comparing embeddings against strong/weak baselines, and 2) Residual embedding refinement that stabilizes updates by integrating current and historical gradients.

Result: Achieved 5% accuracy gain on MathQA without additional training, with extensive experiments on five reasoning benchmarks demonstrating effectiveness.

Conclusion: The proposed framework effectively refines latent reasoning trajectories, enabling improved reasoning accuracy through controlled embedding updates without explicit token generation.

Abstract: Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model’s latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5% accuracy gain on MathQA without additional training.

[126] ToxSyn: Reducing Bias in Hate Speech Detection via Synthetic Minority Data in Brazilian Portuguese

Iago Alves Brito, Julia Soares Dollis, Fernanda Bufon Färber, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: ToxSyn is the first Portuguese large-scale corpus for multi-label hate speech detection across nine protected groups, featuring discourse-type annotations and crucial non-toxic counterexamples that reveal catastrophic generalization failures between social media and minority-specific contexts.

DetailsMotivation: Current hate speech detection systems lack large-scale, fine-grained training data, especially for non-English languages. Existing datasets have coarse labels and critically lack non-toxic counterexamples about minorities, making it impossible to distinguish genuine hate from benign discussion.

Method: Created ToxSyn via a controllable four-stage pipeline that generates Portuguese hate speech data with discourse-type annotations (capturing rhetorical strategies like sarcasm/dehumanization) and systematically includes non-toxic counterexamples about minorities.

Result: Experiments revealed catastrophic mutual generalization failure: models trained on social media fail on minority-specific contexts, and vice-versa. Summary metrics like Macro F1 mask model failures and are unreliable indicators of true performance.

Conclusion: Social media hate speech detection and minority-specific hate speech detection are distinct tasks. ToxSyn is publicly released to advance synthetic data generation and benchmark progress in hate speech detection for low- and mid-resource languages.

Abstract: The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on coarse toxic/non-toxic labels, and the few that capture hate directed at specific minority groups critically lack the non-toxic counterexamples (i.e., benign text about minorities) required to distinguish genuine hate from mere discussion. We introduce ToxSyn, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups. Generated via a controllable four-stage pipeline, ToxSyn includes discourse-type annotations to capture rhetorical strategies of toxic language, such as sarcasm or dehumanization. Crucially, it systematically includes the non-toxic counterexamples absent in all other public datasets. Our experiments reveal a catastrophic, mutual generalization failure between social-media domains and ToxSyn: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn at HuggingFace to foster reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.

[127] Don’t Pay Attention

Mohammad Hammoud, Devang Acharya

Main category: cs.CL

TL;DR: Avey is a new architecture that replaces attention and recurrence with a ranker and neural processor, enabling efficient processing of arbitrarily long sequences while maintaining competitive performance on standard NLP tasks.

DetailsMotivation: To overcome the Transformer's limitations of fixed context windows and quadratic complexity, and recurrent models' reduced parallelism, by creating a more efficient architecture for long sequences.

Method: Pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens, decoupling sequence length from context width.

Result: Avey performs comparably to Transformer on standard short-range NLP benchmarks and significantly outperforms it on long-range dependency modeling tasks.

Conclusion: Avey provides an effective alternative to both attention-based and recurrent architectures, enabling efficient processing of arbitrarily long sequences while maintaining strong performance.

Abstract: The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while significantly outperforming it on tasks requiring long-range dependency modeling.

[128] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

Bailin Wang, Chang Lan, Chong Wang, Ruoming Pang

Main category: cs.CL

TL;DR: RATTENTION combines local attention with linear attention to capture out-of-window tokens, achieving full-attention performance with smaller 512-window sizes while maintaining training efficiency and improving long-context capabilities.

DetailsMotivation: Address the limitation of local attention models that completely disregard tokens outside their window, enabling efficiency gains in short-context regimes without performance degradation.

Method: Propose RATTENTION - local attention integrated with specialized linear attention mechanism to capture information from out-of-window tokens, implemented with custom Pallas kernels.

Result: RATTENTION with 512 window size consistently matches full-attention performance across diverse settings, achieves superior Pareto tradeoff between performance and efficiency, and shows enhanced long-context performance on RULER benchmark.

Conclusion: RATTENTION shifts the Pareto frontier for local-global attention models, enabling significant efficiency gains without compromising performance, with open-sourced implementation for further research.

Abstract: Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention – its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches. We open-sourced our Pallas kernels along with model codes to facilitate further research effort.

[129] Self-Organizing Language

P. Myles Eugenio, Anthony Beavers

Main category: cs.CL

TL;DR: A novel emergent local memory paradigm that creates continuous-learning parallel content-addressable memory with global order, demonstrating how local learning constraints produce topologically protected memories and emergent symbolic order, serving as a neuro-symbolic bridge.

DetailsMotivation: To understand how local constraints on uncoordinated learning can produce emergent symbolic order and topologically protected memories, and to explore the origin of human language without external data.

Method: Introduces emergent local memory - a continuous-learning completely-parallel content-addressable memory encoding global order through self-organizing dynamics.

Result: The system demonstrates the ability to produce human language without data by exploiting its own self-organizing dynamics, showing that words arise as a side-effect of emergent symbolic order.

Conclusion: This work answers essential questions about the existence and origin of human language data, revealing that human language patterns reflect a universal subregular mechanism of word formation.

Abstract: We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence & origin of all the human language data.

[130] The Trilemma of Truth in Large Language Models

Germans Savcisens, Tina Eliassi-Rad

Main category: cs.CL

TL;DR: sAwMIL is a new framework that uses multiple-instance learning and conformal prediction to probe LLM knowledge, revealing flaws in existing methods and showing truth/falsehood are encoded asymmetrically with a third distinct signal.

DetailsMotivation: To address flawed assumptions in existing LLM knowledge probing methods and develop a more reliable way to assess what LLMs actually "know" versus what they encode as probabilistic information.

Method: Introduced sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework combining multiple-instance learning with conformal prediction, using internal LLM activations to classify statements as true, false, or neither.

Result: Evaluation across 16 LLMs showed: (1) common probing methods are unreliable and sometimes worse than zero-shot prompting; (2) truth and falsehood are encoded asymmetrically; (3) LLMs encode a third distinct signal beyond true/false.

Conclusion: Existing probing methods have fundamental flaws, and sAwMIL provides a more reliable framework for understanding how LLMs encode knowledge, revealing complex asymmetric encoding patterns with a third distinct signal type.

Abstract: The public often attributes human-like qualities to large language models (LLMs) and assumes they “know” certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.

[131] RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu

Main category: cs.CL

TL;DR: RAG-R1 introduces a two-stage training framework using multi-query parallelism to overcome limitations of single-query RAG methods, improving reasoning robustness and reducing inference latency.

DetailsMotivation: LLMs generate hallucinated or outdated content due to static internal knowledge, and existing RAG+RL methods suffer from prohibitive latency and brittleness in single-query mode.

Method: Two-stage training framework with multi-query parallelism that enables LLMs to adaptively leverage internal and external knowledge during reasoning, transitioning from single-query to parallel processing.

Result: Outperforms strongest baseline by up to 13.7% on seven QA benchmarks and decreases inference time by 11.1%.

Conclusion: Multi-query parallelism in RAG-R1 effectively enhances reasoning robustness while significantly reducing inference latency compared to single-query approaches.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.

[132] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

Main category: cs.CL

TL;DR: CognitiveAttack is a novel framework that exploits multi-bias interactions to bypass LLM safety mechanisms, achieving significantly higher attack success rates than existing methods.

DetailsMotivation: To address the overlooked vulnerability of LLM safety mechanisms to adversarial attacks that exploit cognitive biases, particularly the powerful but underexplored multi-bias interactions.

Method: Proposes CognitiveAttack framework using supervised fine-tuning and reinforcement learning to generate prompts embedding optimized combinations of cognitive biases, systematically leveraging both individual and combined biases.

Result: Achieved 60.1% attack success rate vs 31.6% for SOTA black-box method PAP, exposing significant vulnerabilities across 30 diverse LLMs, especially in open-source models.

Conclusion: Multi-bias interactions represent a powerful attack vector, highlighting critical limitations in current defense mechanisms and paving the way for more robust, human-aligned AI systems through interdisciplinary cognitive science-LLM safety integration.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

[133] Unveiling the Influence of Amplifying Language-Specific Neurons

Inaya Rahmanisa, Lyzander Marciano Andrylie, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Main category: cs.CL

TL;DR: Amplifying language-specific neurons in multilingual LLMs can effectively steer outputs toward target languages but generally degrades cross-lingual performance, with limited benefits for cross-language transfer.

DetailsMotivation: To investigate the underexplored role of amplifying language-specific neurons in multilingual LLMs and understand their impact on model behavior across different languages, especially low-resource ones.

Method: Amplified language-specific neurons across 18 languages using three models trained in different languages, evaluated with Language Steering Shift (LSS) score and tested on downstream tasks including commonsense reasoning, knowledge, and translation.

Result: Optimal amplification factors effectively steer outputs toward target languages, improving self-language performance in some cases but generally degrading cross-language results on downstream tasks.

Conclusion: Language-specific neuron amplification benefits low-resource languages but provides limited advantage for cross-lingual transfer, highlighting their role in multilingual behavior.

Abstract: Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: NyayaRAG is a Retrieval-Augmented Generation framework that enhances Legal Judgment Prediction in India by incorporating statutory provisions and judicial precedents alongside case facts, improving both prediction accuracy and explanation quality.

DetailsMotivation: Previous LJP approaches in India focused only on internal case content, ignoring the crucial role of statutory provisions and judicial precedents in common law systems, leading to incomplete legal reasoning.

Method: Proposed NyayaRAG framework that provides models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases using a domain-specific pipeline for the Indian legal system.

Result: Augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality across various input configurations, as measured by standard metrics and LLM-based evaluators.

Conclusion: Incorporating statutory provisions and judicial precedents through RAG framework is essential for realistic legal judgment prediction in common law systems like India, leading to more accurate and explainable outcomes.

Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

[135] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang

Main category: cs.CL

TL;DR: Fine-tuning LLMs on political topics causes unintended political stance generalization across unrelated topics. The paper identifies political neurons and proposes InhibitFT method to mitigate this issue.

DetailsMotivation: Previous studies identified political stance manipulation through fine-tuning but lacked understanding of internal mechanisms and cross-topic generalization. The paper aims to systematically explore these internal mechanisms and develop mitigation methods.

Method: Proposed Political Neuron Localization through Activation Contrasting (PNLAC) to identify general and topic-specific political neurons. Developed InhibitFT, an inhibition-based fine-tuning method that selectively inhibits neurons to reduce cross-topic generalization.

Result: Identified two types of political neurons across four models and datasets. InhibitFT reduces cross-topic stance generalization by 20% on average while preserving topic-specific performance. Only 5% of neurons need inhibition for effective mitigation.

Conclusion: The study reveals internal mechanisms of political stance generalization in LLMs and provides an effective mitigation method. InhibitFT successfully reduces unintended cross-topic generalization while maintaining model performance on target topics.

Abstract: Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model’s political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.

[136] NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Leonidas Zotos, Ivo Pascal de Jong, Matias Valdenegro-Toro, Andreea Ioana Sburlea, Malvina Nissim, Hedderik van Rijn

Main category: cs.CL

TL;DR: LLMs outperform professors in estimating True/False exam question difficulty, with supervised learning using LLM uncertainty achieving best results using only 42 training samples.

DetailsMotivation: Professors are not always good at estimating exam question difficulty, which is essential for developing quality exams.

Method: Compared various LLM-based methods with three professors on True/False questions in Neural Networks and Machine Learning. Used direct prompting with Gemini 2.5 and supervised learning with LLM uncertainties.

Result: Professors had limited ability to distinguish easy vs difficult questions. Gemini 2.5 outperformed professors, and supervised learning with LLM uncertainties achieved even better results using minimal training data.

Conclusion: Supervised learning using LLM uncertainty can help professors better estimate exam question difficulty, improving assessment quality.

Abstract: Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.

[137] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, Lu Hou

Main category: cs.CL

TL;DR: CGRS is a method that reduces overthinking in Large Reasoning Language Models by suppressing reflection triggers when the model is confident, cutting token usage by 18.5-41.9% while maintaining accuracy.

DetailsMotivation: LRLMs use reflection behaviors with trigger words like "Wait" and "Alternatively" to improve reasoning, but this causes overthinking - generating redundant reasoning steps that increase token usage, raise costs, and reduce practical utility.

Method: Certainty-Guided Reflection Suppression (CGRS) dynamically suppresses reflection trigger generation when the model shows high confidence in its current response, preventing redundant reflection cycles without model retraining or architectural changes.

Result: CGRS reduces token usage by 18.5% to 41.9% across four reasoning benchmarks while preserving accuracy, achieving optimal balance between length reduction and performance across various model architectures and scales (4B to 32B parameters).

Conclusion: CGRS is an effective, model-agnostic method that mitigates overthinking in LRLMs while maintaining reasoning quality, offering practical value for efficient reasoning without requiring retraining or architectural modifications.

Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning.

[138] You Don’t Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures

Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang

Main category: cs.CL

TL;DR: LogicRAG is a framework that dynamically constructs logical reasoning structures at inference time to guide adaptive retrieval, eliminating the need for pre-built graphs and reducing token costs while improving performance.

DetailsMotivation: Existing GraphRAG methods require costly pre-built graphs that may not align with query-specific reasoning structures, leading to ineffective knowledge retrieval and high token costs.

Method: Decomposes queries into subproblems, constructs dependency DAGs, linearizes via topological sort, and applies graph/context pruning to reduce redundant retrieval and token costs.

Result: Extensive experiments show LogicRAG achieves superior performance and efficiency compared to state-of-the-art baselines.

Conclusion: LogicRAG provides an effective alternative to pre-built graph methods by dynamically extracting reasoning structures, enabling adaptive retrieval with reduced costs and improved accuracy.

Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a $\textbf{Logic}$-aware $\textbf{R}etrieval$-$\textbf{A}$ugmented $\textbf{G}$eneration framework ($\textbf{LogicRAG}$) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.

[139] SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan

Main category: cs.CL

TL;DR: SceneJailEval is a scenario-adaptive multi-dimensional framework for jailbreak evaluation that overcomes the limitations of existing binary classification and unified multi-dimensional methods, achieving state-of-the-art performance.

DetailsMotivation: Current jailbreak evaluation methods have limitations: binary classification lacks harm severity quantification, while unified multi-dimensional frameworks suffer from scenario-specific mismatches (e.g., applying irrelevant dimensions like 'Relative Truthfulness' to hate speech scenarios).

Method: Proposed SceneJailEval with scenario-adaptive multi-dimensional framework, a novel 14-scenario dataset with rich jailbreak variants and regional cases, and robust extensibility for customized or emerging scenarios.

Result: Achieved F1 score of 0.917 on full-scenario dataset (+6% over SOTA) and 0.995 on JBB benchmark (+3% over SOTA), breaking through accuracy bottleneck in heterogeneous scenarios.

Conclusion: SceneJailEval successfully addresses the critical ‘one-size-fits-all’ limitation of existing methods and establishes superiority in jailbreak evaluation across diverse scenarios.

Abstract: Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only “yes/no” labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., “Relative Truthfulness” is irrelevant to “hate speech”), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical “one-size-fits-all” limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

[140] ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features

A. J. W. de Vink, Natalia Amat-Lefort, Lifeng Han

Main category: cs.CL

TL;DR: ReviewGraph is a framework that converts customer reviews into knowledge graphs with sentiment scores to predict review ratings, achieving performance comparable to LLMs with better interpretability and lower computational cost.

DetailsMotivation: Understanding factors driving customer review ratings is critical in hospitality industry for improving guest satisfaction and business performance.

Method: Transform textual reviews into knowledge graphs using (subject, predicate, object) triples with sentiment scores, then use graph embeddings (Node2Vec) and sentiment features with machine learning classifiers for rating prediction.

Result: Performs similar to state-of-the-art models but with lower computational cost, achieves comparable performance to LLMs, outperforms traditional NLP baselines on agreement-based metrics like Cohen’s Kappa.

Conclusion: Graph-based representations offer advantages in interpretability, visual exploration, and RAG integration potential, laying groundwork for future research with advanced graph neural networks and fine-tuned LLM extraction methods.

Abstract: In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen’s Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph

[141] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee

Main category: cs.CL

TL;DR: Fine-tuning LLMs for agentic tasks can unintentionally make them misaligned and more likely to execute harmful requests. PING addresses this by prepending natural language prefixes to guide refusal of harmful tasks while maintaining performance on benign ones.

DetailsMotivation: Safety concerns are often overlooked when fine-tuning LLMs for agentic capabilities, leading to unintentional misalignment where models become more likely to execute harmful tasks and less likely to refuse them.

Method: Proposed Prefix INjection Guard (PING) - an iterative approach that alternates between generating candidate natural language prefixes and selecting those that optimize both task performance and refusal behavior for harmful requests.

Result: PING significantly enhances safety of fine-tuned LLM agents without sacrificing effectiveness, outperforming existing prompting approaches across web navigation and code generation benchmarks. Analysis shows prefix tokens are crucial for behavior modification.

Conclusion: PING provides an effective method to maintain safety alignment in agentic LLMs during fine-tuning, using natural language prefixes to guide refusal behavior while preserving task performance.

Abstract: Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.

[142] Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Meiling Ning, Zhongbao Zhang, Junda Ye, Jiabao Guo, Qingyuan Guan

Main category: cs.CL

TL;DR: The paper proposes ESFP-RM, a two-stage reward model that reframes reward modeling as natural language inference and uses masked language models with contextual explanations to provide more stable and generalizable reward signals for reinforcement learning.

DetailsMotivation: To advance LM-based judging reward modeling by recognizing its formal consistency with natural language inference and scaling model comprehension boundaries for superior reward models.

Method: Proposes ESFP-RM: a two-stage reward model using explanation-based slot framework with masked language models, leveraging contextual explanations for better prediction.

Result: ESFP-RM delivers more stable and generalizable reward signals compared to generative reward models in both RLHF and out-of-distribution scenarios.

Conclusion: Reframing reward modeling as NLI and using MLMs with contextual explanations provides a promising path for building superior, more generalizable reward models.

Abstract: The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model’s comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.

[143] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chih-Ho Hsu, Li-Hung Yao, Chun-Chieh Liao, Feng Liu, Fang-Ming Hung

Main category: cs.CL

TL;DR: RPRO is a reinforcement learning framework that enhances clinical reasoning in LLMs by combining preference optimization with quality-driven refinement, outperforming larger models on medical QA tasks.

DetailsMotivation: Existing LLMs generate reasoning chains that lack factual accuracy and clinical reliability in medical question answering, requiring improved methods for clinical chain-of-thought performance.

Method: Ranked Preference Reinforcement Optimization (RPRO) combines reinforcement learning with preference-driven reasoning refinement, using task-adaptive reasoning templates, probabilistic evaluation, groupwise ranking optimization based on Bradley-Terry model, and KL-divergence regularization.

Result: Experiments on PubMedQA, MedQA-USMLE, and FEMH clinical dataset show consistent improvements over baselines, with 2B-parameter model outperforming larger 7B-20B models including medical-specialized variants.

Conclusion: Combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.

Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley–Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B–20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.

[144] GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu, Tong Xiao

Main category: cs.CL

TL;DR: GRAM-R² is a generative reward model that produces both preference labels and reward rationales through self-training on unlabeled data, serving as a foundation model for reward reasoning across various tasks.

DetailsMotivation: Current reward models heavily rely on large-scale labeled preference data, and existing pre-training approaches fail to instill explicit reasoning capabilities into reward models.

Method: Proposed a self-training approach that leverages unlabeled data to elicit reward reasoning, developing GRAM-R² - a generative reward model trained to produce preference labels and accompanying reward rationales.

Result: GRAM-R² consistently outperforms strong discriminative and generative baselines in experiments on response ranking, task adaptation, and reinforcement learning from human feedback.

Conclusion: GRAM-R² serves as an effective foundation model for reward reasoning that can be applied to various tasks with minimal fine-tuning, supporting downstream applications like response ranking and task-specific reward tuning.

Abstract: Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

[145] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

Main category: cs.CL

TL;DR: MedFact is a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances across diverse medical domains, designed to evaluate LLMs’ fact-checking capabilities in medical contexts.

DetailsMotivation: Deploying LLMs in medical applications requires robust fact-checking capabilities to ensure patient safety and regulatory compliance, but existing benchmarks lack the complexity and quality needed for medical fact-checking evaluation.

Method: Used a hybrid AI-human framework with iterative expert feedback to refine AI-driven, multi-criteria filtering for constructing high-quality benchmark data spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels.

Result: Evaluation of 20 leading LLMs shows they can often determine if text contains errors but struggle with precise error localization, with top performers falling short of human performance. Models exhibit ‘over-criticism’ - misidentifying correct information as erroneous.

Conclusion: MedFact highlights significant challenges in deploying medical LLMs and provides essential resources to develop factually reliable medical AI systems, revealing that advanced reasoning techniques can exacerbate fact-checking errors.

Abstract: Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the “over-criticism” phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

[146] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu, Eric Hanchen Jiang, Wenke Huang, Guibin Zhang, Hejia Geng, Xiangru Tang, Zhenfei Yin, Yizhou Sun, Wei Wang

Main category: cs.CL

TL;DR: A framework to diagnose, localize, and align LLM-powered multi-agent systems to improve instruction hierarchy compliance under conflicts.

DetailsMotivation: LLM-powered multi-agent systems suffer from hierarchical compliance failures under instruction conflicts, where agents misprioritize system rules when faced with competing demands. Current metrics like pass@k fail to capture these micro-level violations.

Method: Three-stage framework: (1) CRAS metric to diagnose role adherence across four dimensions, (2) attention drift analysis to localize conflict resolution in middle layers, (3) SAIL surgical alignment using LoRA on focal layers with token-weighted DPO optimization.

Result: Improves instruction hierarchy compliance by +5.60% on AutoGen with MedQA benchmark without full-model finetuning.

Conclusion: Surgical alignment on localized layers effectively addresses hierarchical compliance failures in multi-agent systems while being more efficient than full-model finetuning.

Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.

[147] Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, B. Aditya Prakash, Yizhou Sun, Wei Wang

Main category: cs.CL

TL;DR: TARE is a derivative-free prompt optimization framework that minimizes textual sharpness by alternating between adversarial paraphrase search and robust selection, creating prompts that maintain accuracy under semantic variations.

DetailsMotivation: Current prompt optimization methods focus only on point-wise accuracy and ignore paraphrase invariance, making automated prompt search brittle to small semantic-preserving changes that cause large performance swings.

Method: TARE alternates between inner adversarial search with hard paraphrases and outer robust selection. ATARE extends this with anisotropic weights to shape semantic neighborhoods and adaptive radius balancing exploration and fidelity.

Result: The methods outperform accuracy-only prompt search by creating prompts that preserve accuracy under paraphrasing while remaining computationally practical across diverse tasks.

Conclusion: Addressing textual sharpness through semantic neighborhood robustness leads to more stable and reliable prompt optimization that maintains performance under semantic variations.

Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model’s parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.

[148] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Luyang Zhang, Jialu Wang, Shichao Zhu, Siyuan Peng, Beibei Li, Zhongcun Wang, Guangmou Pan, Yan Li, Yang Song

Main category: cs.CL

TL;DR: PET framework tracks evolving user preferences by modeling them as dynamic probability distributions over interpretable preference clusters, outperforming direct generation methods in ranking quality and long-tail content recommendation.

DetailsMotivation: Direct LLM generation for user preference prediction limits personalization, obscures decision-making, and exacerbates popularity bias, requiring a more transparent and holistic approach to preference modeling.

Method: PET reframes preference prediction as inferring dynamic probability distributions over stable preference clusters using logit-probing and generative classification techniques, enabling transparent preference learning.

Result: PET improves ranking quality by up to 40% in NDCG on public benchmarks and outperforms state-of-the-art production models by 7 times in NDCG for long-tail content on real-world short-video platform data.

Conclusion: PET transforms user profiling from direct preference list generation to transparent distributional preference mapping, enabling more explainable, fair, and diverse personalization systems.

Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user’s next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user’s preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.

[149] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation

Yanming Sun, Runzhe Zhan, Chi Seng Cheang, Han Wu, Xuebo Liu, Yuyao Niu, Fengying Ye, Kaixin Lan, Lidia S. Chao, Derek F. Wong

Main category: cs.CL

TL;DR: REAL-MT systems show promise for idiomatic translation but degrade severely under noisy retrieval, especially for low-resource languages. Large reasoning models are even more susceptible to noise and rationalize incorrect contexts.

DetailsMotivation: To address the gap in understanding REAL-MT reliability under noisy retrieval contexts, which is common in real-world deployment but poorly studied.

Method: Proposed noise synthesis framework and new metrics to evaluate REAL-MT robustness systematically. Instantiated REAL-MT with Qwen-series models (standard LLMs and large reasoning models) and evaluated on idiomatic translation across language pairs under synthesized noise.

Result: Low-resource language pairs degrade more severely under noise than high-resource ones, often producing nonsensical translations. LRMs show no improvement in error correction and are more susceptible to noise, rationalizing incorrect contexts. Attention shifts away from source idiom to noisy content while confidence increases despite declining accuracy.

Conclusion: Current approaches have limitations, revealing a fundamental trade-off between robustness and clean-context performance. Highlights need for self-verifying integration mechanisms to improve REAL-MT reliability.

Abstract: \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.

[150] Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles

Nusrat Jahan Lia, Shubhashis Roy Dipta, Abdullah Khan Zehady, Naymul Islam, Madhusodan Chakraborty, Abdullah Al Wasif

Main category: cs.CL

TL;DR: First benchmark dataset for Bangla political bias detection with 200 news articles labeled for government-leaning, government-critique, and neutral stances, plus diagnostic analysis of 28 LLMs.

DetailsMotivation: Addressing the scarcity of annotated datasets and computational studies for Bangla political bias research, which requires understanding complex linguistic, cultural, and socio-political factors.

Method: Created a dataset of 200 politically significant Bangla news articles labeled for three stances, and conducted comprehensive evaluation of 28 proprietary and open-source LLMs using diagnostic analyses.

Result: LLMs showed strong performance in detecting government-critique content (F1 up to 0.83) but struggled significantly with neutral articles (F1 as low as 0.00), with tendency to over-predict government-leaning stances.

Conclusion: The dataset provides foundation for advancing Bangla stance detection research and offers insights for improving LLM performance in low-resource languages, highlighting challenges in detecting neutral content.

Abstract: Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.

[151] A Human Behavioral Baseline for Collective Governance in Software Projects

Mobina Noori, Mahasweta Chakraborti, Amy X Zhang, Seth Frey

Main category: cs.CL

TL;DR: Analysis of how open source communities evolve governance documents over time, showing expansion and balancing of participation categories without major shifts in prescriptive control.

DetailsMotivation: To understand how open source communities describe participation and control through governance documents and track their evolution over time.

Method: Analyzed 710 projects with paired snapshots, parsing governance text into actors, rules, actions, and objects, then measuring change using entropy (evenness), richness (diversity), and Jensen Shannon divergence (drift).

Result: Projects define more roles and actions over time with more even distribution, while rule composition remains stable. Governance grows by expanding and balancing participation categories without major shifts in prescriptive force.

Conclusion: The study provides a reproducible baseline for evaluating whether future AI-mediated workflows concentrate or redistribute authority in open source governance.

Abstract: We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.

[152] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

Darko Sasanski, Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov

Main category: cs.CL

TL;DR: First systematic construction of a Macedonian recipe dataset through web scraping and structured parsing, with analysis of distinctive ingredient patterns in Macedonian cuisine.

DetailsMotivation: Macedonian recipes are under-represented in computational gastronomy research despite the need for diverse datasets capturing regional culinary traditions.

Method: Web scraping and structured parsing of Macedonian recipes, with normalization of heterogeneous ingredient descriptions (units, quantities, descriptors), followed by exploratory analysis using Pointwise Mutual Information and Lift score to identify ingredient co-occurrence patterns.

Result: Created a new Macedonian recipe dataset and identified distinctive ingredient combinations that characterize Macedonian cuisine through frequency and co-occurrence analysis.

Conclusion: The dataset provides a valuable resource for studying food culture in underrepresented languages and reveals unique patterns of Macedonian culinary tradition.

Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.

[153] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei

Main category: cs.CL

TL;DR: DeceptionBench is the first benchmark that systematically evaluates deceptive behaviors in LLMs across five societal domains, revealing critical vulnerabilities and amplified deception under reinforcement dynamics.

DetailsMotivation: The rapid enhancement of LLM capabilities introduces emergent deceptive behaviors that pose severe risks in high-stakes deployments, yet characterization of deception across realistic scenarios remains underexplored.

Method: Established a benchmark with 150 scenarios across Economy, Healthcare, Education, Social Interaction, and Entertainment domains, with over 1,000 samples. Evaluated intrinsic patterns (egoistic vs sycophantic behaviors) and extrinsic factors (neutral conditions, reward-based incentivization, coercive pressures) with multi-turn interaction loops.

Result: Extensive experiments reveal critical vulnerabilities in LLMs and LRMs, particularly amplified deception under reinforcement dynamics, showing models lack robust resistance to manipulative contextual cues.

Conclusion: Current models urgently need advanced safeguards against various deception behaviors, as demonstrated by their vulnerability to manipulative contextual cues and reinforcement dynamics.

Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.

[154] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

Main category: cs.CL

TL;DR: ORBIT is a rubric-based incremental training framework for medical dialogue that uses dynamically constructed rubrics as adaptive guides for reinforcement learning, avoiding the need for external knowledge bases or task-specific fine-tuning.

DetailsMotivation: Traditional RL methods deteriorate in open-ended domains like medical consultation where feedback is ambiguous and context-dependent, leading to risks like reward hacking in high-stakes medical dialogue.

Method: Integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL, using rubric-driven feedback and general-purpose LLMs as judges without task-specific fine-tuning.

Result: Applied to Qwen3-4B-Instruct model, ORBIT raises HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale and competing with strongest open-source baselines.

Conclusion: Rubric-guided RL consistently improves consultation quality across diverse medical scenarios and generalizes to other domains like instruction-following, highlighting the generality of rubric-based feedback.

Abstract: Reinforcement learning has powered many of the recent breakthroughs in large language models, especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in open-ended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking - an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.

[155] Chain-of-Conceptual-Thought Elicits Daily Conversation in Large Language Models

Qingqing Gu, Dan Wang, Yue Zhao, Xiaoyu Wang, Zhonglin Jiang, Yong Chen, Hongyan Li, Luo Ji

Main category: cs.CL

TL;DR: Proposes Chain of Conceptual Thoughts (CoCT), a prompt-based paradigm that guides LLMs to first generate concept tags (emotions, strategies, topics) then detailed content, improving performance in open-domain tasks like conversations.

DetailsMotivation: Chain-of-Thought (CoT) has limited effectiveness for open-domain tasks where reasoning steps aren't clearly defined, requiring a new approach for tasks without logical transitions.

Method: CoCT paradigm where LLMs first produce concept tags (emotions, strategies, topics) then generate detailed content following these concepts, implementing hierarchical thinking.

Result: CoCT outperforms baselines including self-refine, ECoT, SoT and RAG in daily and emotional support conversations across in-domain and out-of-domain concept settings, validated by automatic, human and LLM-based evaluations.

Conclusion: CoCT provides a potential solution for LLM prompting paradigm that can be applied to a wider scope of tasks beyond traditional reasoning domains.

Abstract: Chain-of-Thought (CoT) is widely applied to enhance the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks, when there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose a new prompt-based paradigm called Chain of Conceptual Thoughts (CoCT), which suggests the LLM first to produce the tag of concepts, then complete the detailed content following the concept. To encourage this hierarchical way of thinking, we implement the concepts with emotions, strategies and topics. We experiment with this paradigm in daily and emotional support conversations, covering tasks with both in-domain and out-of-domain concept settings. Automatic, human, and LLM-based evaluations reveal that CoCT surpasses several prompt-based baselines such as self-refine, ECoT, SoT and RAG, suggesting a potential solution of LLM prompting paradigm for a wider scope of tasks.

[156] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Main category: cs.CL

TL;DR: EvolveQA is a new benchmark for evaluating LLMs’ ability to handle temporal knowledge conflicts using real-world time-stamped corpora, showing up to 31% performance drop compared to static knowledge.

DetailsMotivation: Existing benchmarks for temporal knowledge conflicts use structured knowledge bases focusing on popular entities and lack dynamic structure to fairly evaluate LLMs with different knowledge cut-off dates.

Method: Constructed from 3 real-world time-stamped corpora (AWS updates, Azure changes, WHO disease outbreaks), identifies natural knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates.

Result: Evaluation of 12 LLMs across 3 knowledge probing formats shows significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

Conclusion: EvolveQA effectively reveals LLMs’ limitations in handling temporally evolving knowledge and provides a fair evaluation framework for different knowledge cut-off dates.

Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

[157] SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

Main category: cs.CL

TL;DR: SelecTKD is a selective token-weighted knowledge distillation framework that uses a propose-and-verify mechanism to focus learning on high-confidence teacher tokens, improving student model performance across various tasks.

DetailsMotivation: Standard knowledge distillation applies uniform token-wise loss regardless of teacher confidence, which amplifies noisy signals and is harmful under large teacher-student capacity gaps.

Method: Uses a propose-and-verify procedure where student proposes tokens that are verified by teacher (greedy Top-k or non-greedy Spec-k variants). Accepted tokens receive full loss while rejected tokens are masked/down-weighted.

Result: Consistently improves strong baselines and achieves state-of-the-art results for small models across instruction following, mathematical reasoning, code generation, and VLM settings without architectural changes.

Conclusion: SelecTKD provides an effective plug-and-play distillation framework that shifts focus from measuring divergence to selective learning, stabilizing optimization and improving performance across diverse tasks.

Abstract: Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from “how to measure divergence” to “where to apply learning”. At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

[158] A Survey on Unlearning in Large Language Models

Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun

Main category: cs.CL

TL;DR: This survey systematically reviews LLM unlearning methods, categorizing them by intervention phase and strategy, and analyzes evaluation paradigms including datasets and metrics.

DetailsMotivation: To address risks from memorized sensitive information in LLMs and align with legal standards through selective knowledge erasure without compromising performance.

Method: Systematic review of 180+ papers with novel taxonomy categorizing unlearning methods by intervention phase (parameter modification vs. selection), and multidimensional analysis of evaluation paradigms including 18 benchmarks and 10 knowledge memorization metric categories.

Result: Provides comprehensive framework for understanding LLM unlearning methods and evaluation approaches, enabling deeper insights and comparative analysis across different strategies.

Conclusion: The survey advances LLM unlearning field and secure AI development by establishing systematic categorization and evaluation framework, while identifying current challenges and future directions.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems.

[159] Multi-Personality Generation of LLMs at Decoding-time

Rongxin Chen, Yunfan Li, Yige Yuan, Bingbing Xu, Huawei Shen

Main category: cs.CL

TL;DR: Proposes a novel Multi-Personality Generation framework that enables LLMs to embody multiple personalization attributes simultaneously without retraining, using speculative chunk-level rejection sampling for efficient implementation.

DetailsMotivation: Existing methods for multi-personality generation in LLMs are either costly (retraining-based) or limited in flexibility (decoding-time methods relying on external models/heuristics), creating a need for more efficient and robust solutions.

Method: MPG framework under decoding-time combination paradigm that leverages implicit density ratios in single-dimensional models, implemented via Speculative Chunk-level based Rejection sampling (SCR) which generates responses in chunks and validates them parallelly.

Result: Experiments on MBTI personality and Role-Playing show improvements up to 16%-18% in multi-personality generation effectiveness.

Conclusion: The proposed MPG framework provides an efficient and flexible solution for multi-personality generation in LLMs without requiring retraining or external models, achieving significant performance improvements.

Abstract: Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .

[160] Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson

Main category: cs.CL

TL;DR: The paper introduces the Silenced Bias Benchmark (SBB) to uncover hidden biases in safety-aligned LLMs that are masked by refusal responses, using activation steering to reveal underlying unfair preferences.

DetailsMotivation: Current fairness evaluation methods for LLMs often misinterpret refusal responses as positive fairness indicators, creating a false sense of fairness while overlooking deeper biases concealed by safety alignment.

Method: Proposed Silenced Bias Benchmark (SBB) that uses activation steering to reduce model refusals during question-answer evaluations, allowing detection of silenced biases in the latent space without relying on prompt manipulation.

Result: The approach reveals an alarming distinction between models’ direct refusal responses and their underlying fairness issues, exposing biases that were effectively concealed by safety alignment.

Conclusion: SBB provides a scalable fairness evaluation framework that can expand to new demographic groups and subjects, encouraging development of fair models beyond the masking effects of alignment training.

Abstract: Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model’s refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models’ latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models’ direct responses and their underlying fairness issues.

[161] Evaluating LLMs’ Reasoning Over Ordered Procedural Steps

Adrita Anika, Md Messal Monem Miah

Main category: cs.CL

TL;DR: LLMs struggle with reconstructing correct sequences from shuffled procedural steps, especially with longer sequences and more severe shuffling, as evaluated using ranking and sequence alignment metrics.

DetailsMotivation: Reasoning over procedural sequences is critical for LLMs, and understanding their limitations in reconstructing globally ordered sequences from shuffled steps is important for domains like food recipes where sequencing affects outcomes.

Method: Evaluated several LLMs on a curated dataset of food recipes using zero-shot and few-shot settings, with metrics including Kendall’s Tau, NLCS, and NED to capture different aspects of ordering quality.

Result: Model performance declines with increasing sequence length and greater step displacement (more severe shuffling), highlighting limitations in procedural reasoning.

Conclusion: Current LLMs have significant limitations in procedural reasoning tasks, particularly when dealing with longer sequences and more disordered inputs, indicating areas for future improvement.

Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall’s Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.

[162] Self-Correction Distillation for Structured Data Question Answering

Yushan Zhu, Wen Zhang, Long Jin, Mengshu Sun, Ling Zhong, Zhiqiang Liu, Juan Li, Lei Liang, Chong Long, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: Proposes Self-Correction Distillation (SCD) method to improve structured data QA for small LLMs by transferring query-generation and error-correction capabilities from large LLMs.

DetailsMotivation: Small-scale LLMs struggle with structured data QA due to errors in generating structured queries, while existing unified frameworks like TrustUQA work better with large LLMs.

Method: SCD uses Error Prompt Mechanism (EPM) to detect errors and provide customized messages during inference, plus two-stage distillation to transfer capabilities from large to small LLMs.

Result: SCD achieves best performance on 5 benchmarks with 3 structured data types using 8B models, closely approaching GPT4 on some datasets. EPM also helps large LLMs surpass SOTA results.

Conclusion: SCD effectively improves small LLMs’ structured data QA capabilities through self-correction distillation and error detection mechanisms.

Abstract: Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs’ query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.

[163] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: VocalBench-zh is a comprehensive Mandarin speech-to-speech evaluation benchmark with 10 subsets and 10K+ instances, addressing the lack of systematic evaluation tools for Chinese speech interaction models.

DetailsMotivation: The scarcity of comprehensive speech-to-speech benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users.

Method: Created VocalBench-zh, an ability-level divided evaluation suite with 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters adapted to Mandarin context.

Result: Evaluation of 14 mainstream models revealed common challenges for current approaches and highlighted the need for new insights into next-generation speech interactive systems.

Conclusion: VocalBench-zh provides a systematic evaluation framework for Mandarin speech-to-speech models, enabling better model comparison and development of improved speech interaction systems.

Abstract: The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.

[164] A Super-Learner with Large Language Models for Medical Emergency Advising

Sergey K. Aityan, Abdolreza Mosaddegh, Rolando Herrero, Haitham Tayyar, Jiang Han, Vikram Sawant, Qi Chen, Rishabh Jain, Aruna Senthamaraikannan, Stephen Wood, Manuel Mersini, Rita Lazzaro, Mario Balzaneli, Nicola Iacovazzo, Ciro Gargiulo Isacco

Main category: cs.CL

TL;DR: A super-learner MEDAS system combining five major LLMs (Gemini, Llama, Grok, GPT, Claude) achieved 70% diagnostic accuracy for emergency medicine cases, outperforming individual LLMs (58-65%) and human doctors.

DetailsMotivation: To improve medical decision-support systems in emergency medicine by leveraging the combined capabilities of multiple LLMs through ensemble learning.

Method: Built a super-learner system using meta-learning to integrate five major LLMs, learning each model’s specific capabilities to leverage collective diagnostic power.

Result: Individual LLMs achieved 58-65% accuracy, while the super-learner reached 70% accuracy, with at least one LLM achieving 85% accuracy on specific cases.

Conclusion: Meta-learning ensemble approaches can significantly enhance diagnostic accuracy in emergency medicine by combining the strengths of multiple LLMs, outperforming both individual models and human doctors.

Abstract: Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients’ conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.

[165] C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

Yu Li, Zhe Yang, Yi Huang, Xin Liu, Guilin Qi

Main category: cs.CL

TL;DR: C³TG is a two-phase framework for fine-grained multi-dimensional text attribute control that uses selective classifier pairing and iterative optimization to resolve attribute conflicts without model modifications.

DetailsMotivation: Current LLM control methods struggle with precise multi-attribute control, lack coordination for conflicting attributes, and don't incorporate iterative optimization in the generation pipeline.

Method: Two-phase framework: generation phase pairs LLM with required attribute classifiers using weighted KL-divergence; optimization phase uses energy function with classifier scores and penalty terms for iterative conflict resolution.

Result: Significantly outperforms baselines in attribute accuracy, linguistic fluency, output diversity while reducing toxicity, across 17 attribute dimensions.

Conclusion: C³TG provides an effective, flexible solution for multi-dimensional text attribute control that requires no costly model modifications.

Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.

[166] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Ruibo Deng, Duanyu Feng, Wenqiang Lei

Main category: cs.CL

TL;DR: AMaPO resolves the Overfitting-Underfitting Dilemma in offline preference optimization by using adaptive margins to dynamically reallocate learning effort between correctly and incorrectly ranked samples.

DetailsMotivation: Current offline preference optimization methods suffer from the Overfitting-Underfitting Dilemma where excessive gradients are applied to correctly ranked samples while insufficient signals are provided for misranked ones, limiting ranking accuracy.

Method: Proposes Adaptive Margin-attached Preference Optimization (AMaPO) with instance-wise adaptive margins refined by Z-normalization and exponential scaling to amplify gradients for misranked samples and suppress them for correct ones.

Result: AMaPO achieves better ranking accuracy and superior downstream alignment performance compared to existing methods, and targeted analysis confirms it successfully mitigates overfitting and underfitting issues.

Conclusion: AMaPO provides a principled solution to the fundamental Overfitting-Underfitting Dilemma in offline preference optimization, offering improved performance through dynamic gradient reallocation.

Abstract: Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.

[167] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Main category: cs.CL

TL;DR: MLLMs show phonetic intuition aligned with linguistic research on sound symbolism, using attention patterns to focus on iconic phonemes across text and audio modalities.

DetailsMotivation: To investigate how Multimodal Large Language Models interpret auditory information through sound symbolism (phonetic iconicity) as a probe into their understanding of human languages.

Method: Created LEX-ICON dataset with 8,052 words from 4 languages and 2,930 pseudo-words, analyzed MLLMs’ layer-wise processing using phoneme-level attention scores across 25 semantic dimensions for both text and audio inputs.

Result: MLLMs demonstrate phonetic intuitions consistent with linguistic research and show phonosemantic attention patterns that emphasize iconic phonemes across multiple semantic dimensions.

Conclusion: This work bridges AI and cognitive linguistics, providing the first large-scale quantitative analysis of phonetic iconicity in MLLMs’ interpretability, revealing their ability to capture sound-meaning associations.

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[168] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Shaharukh Khan, Souvik Rana, Manya Sah, Chandra Khatri, Shubham Agarwal

Main category: cs.CL

TL;DR: Systematic study on generating and evaluating synthetic multilingual pretraining data for Indic languages, creating BhashaKritika dataset (540B tokens) using 5 techniques across 10 languages, with comprehensive quality evaluation pipeline.

DetailsMotivation: Address the uneven distribution of LLM benefits across languages by providing scalable synthetic pretraining data for low-resource Indic languages as an alternative to limited natural data.

Method: Constructed large-scale synthetic dataset using 5 generation techniques, explored grounding in documents/personas/topics, compared translation vs native generation, and developed modular quality evaluation pipeline with script/language detection, metadata checks, n-gram analysis, and perplexity filtering.

Result: Created BhashaKritika dataset (540B tokens), identified key trade-offs in generation strategies, and established best practices for multilingual corpus construction through empirical model runs.

Conclusion: The study provides a systematic framework for generating and evaluating synthetic multilingual pretraining data, highlighting effective strategies for low-resource language settings and enabling robust quality control across diverse linguistic contexts.

Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

cs.CV

[169] Psychological stress during Examination and its estimation by handwriting in answer script

Abhijeet Kumar, Chetan Agarwal, Pronoy B. Neogi, Mayank Goswami

Main category: cs.CV

TL;DR: AI system analyzes handwritten exam scripts to quantify student stress levels using OCR and sentiment analysis, creating a Stress Index through model voting and anomaly detection.

DetailsMotivation: To go beyond traditional grading by providing insights into students' cognitive and emotional states during exams through handwriting analysis.

Method: Uses high-resolution image processing, TrOCR for text recognition, RoBERTa-based sentiment analysis with entropy fusion, five-model voting mechanism, and unsupervised anomaly detection.

Result: Developed a robust framework that generates a numerical Stress Index from handwritten examination scripts.

Conclusion: The system presents an innovative data-driven approach in academic forensics for quantifying psychological stress through handwriting analysis.

Abstract: This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.

[170] Real-time pothole detection with onboard sensors and camera on vehicles

Aswath Muthuselvam, Jeevak Raj S, Mohanaprasad K

Main category: cs.CV

TL;DR: Real-time pothole detection using vehicle sensors and SVM classifier with 98.1% accuracy on 2km road with 26 potholes.

DetailsMotivation: Road conditions are crucial for daily commute and traffic flow. Small cracks can develop into large potholes due to temperature changes and vehicle pressure, requiring frequent monitoring.

Method: Used onboard vehicle sensors and SVM classifier for real-time pothole detection.

Result: Achieved 98.1% accuracy on data collected from 2km local road containing 26 potholes.

Conclusion: Proposed method enables effective large-scale pothole management through real-time detection using vehicle sensors.

Abstract: Road conditions play an important role in our everyday commute. With the proliferating number of vehicles on the road each year, it has become necessary to access the road conditions very frequently, this would ensure that the traffic also flows smoothly. Even the smallest crack in the road could be easily be chipped into a large pothole due to changing surface temperatures of the road and from the force of vehicles riding over it. In this paper, we have addressed how we could better identify these potholes in realtime with the help of onboard sensors in vehicles so that the data could be useful for analysis and better management of potholes on a large scale. For the implementation, we used an SVM classifier to detect potholes, we achieved 98.1% accuracy based on data collected from a local road for about 2 km which had 26 potholes distributed along the road. Code is available at: https://github.com/aswathselvam/Potholes

[171] A Method for Identifying Farmland System Habitat Types Based on the Dynamic-Weighted Feature Fusion Network Model

Kesong Zheng, Zhi Song, Peizhou Li, Shuyi Yao, Zhenxing Bian

Main category: cs.CV

TL;DR: Proposed DWFF-Net with dynamic-weighted feature fusion for cultivated land habitat segmentation, achieving improved accuracy over baseline models.

DetailsMotivation: Lack of standardized habitat classification system for cultivated land ecosystems, incomplete habitat type coverage, and inability of existing models to effectively integrate semantic and texture features for multi-scale habitats.

Method: Developed annotated ultra-high-resolution remote sensing dataset with 15 habitat categories. Proposed DWFF-Net using frozen-parameter DINOv3 encoder, data-level adaptive dynamic weighting for feature fusion, dynamic weight computation network in decoder, and hybrid loss function.

Result: Achieved mIoU of 0.6979 and F1-score of 0.8049 on constructed dataset, outperforming baseline by 0.021 and 0.0161 respectively. Improved IoU for micro-habitat categories like field ridges.

Conclusion: Established habitat identification framework enabling sub-meter precision habitat mapping at low cost, providing technical support for fine-grained habitat monitoring in cultivated landscapes.

Abstract: Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 0.6979 and an F1-score of 0.8049, outperforming the baseline network by 0.021 and 0.0161, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.

[172] Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

Main category: cs.CV

TL;DR: CalMRL addresses multimodal representation learning with missing modalities by calibrating incomplete alignments through representation-level imputation and bi-step learning.

DetailsMotivation: Existing multimodal learning methods require all modalities to be present, making it challenging to utilize datasets with missing modalities. The paper identifies an 'anchor shift' problem where observed modalities align with suboptimal anchors when some modalities are missing.

Method: Proposes CalMRL which leverages modality priors and connections to model missing modality imputation at representation level. Uses bi-step learning with closed-form solution for posterior distribution of shared latents to resolve optimization challenges.

Result: Extensive experiments demonstrate CalMRL’s superiority in handling missing modalities. The method provides new flexibility to utilize data with missing modalities that was previously unusable.

Conclusion: CalMRL effectively mitigates the anchor shift problem in multimodal representation learning with missing modalities, offering theoretical guarantees and practical improvements for real-world datasets with incomplete modalities.

Abstract: Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

[173] AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation

Ziyuan Gao

Main category: cs.CV

TL;DR: AGENet is a few-shot medical image segmentation framework that uses edge-aware geodesic distance learning and adaptive prototype extraction to improve boundary delineation with minimal training data.

DetailsMotivation: Medical image segmentation requires large annotated datasets, creating bottlenecks for clinical applications. Existing few-shot methods have suboptimal boundary delineation, especially for anatomically similar regions without sufficient spatial context.

Method: Combines edge-aware geodesic distance learning with iterative Fast Marching refinement, adaptive prototype extraction using spatially-weighted aggregation, and adaptive parameter learning that adjusts to different organ characteristics.

Result: Extensive experiments show improvements over state-of-the-art methods, with reduced boundary errors while maintaining computational efficiency.

Conclusion: AGENet is highly suitable for clinical applications requiring precise segmentation with limited annotated data, leveraging predictable geometric patterns in medical structures.

Abstract: Medical image segmentation requires large annotated datasets, creating a significant bottleneck for clinical applications. While few-shot segmentation methods can learn from minimal examples, existing approaches demonstrate suboptimal performance in precise boundary delineation for medical images, particularly when anatomically similar regions appear without sufficient spatial context. We propose AGENet (Adaptive Geodesic Edge-aware Network), a novel framework that incorporates spatial relationships through edge-aware geodesic distance learning. Our key insight is that medical structures follow predictable geometric patterns that can guide prototype extraction even with limited training data. Unlike methods relying on complex architectural components or heavy neural networks, our approach leverages computationally lightweight geometric modeling. The framework combines three main components: (1) An edge-aware geodesic distance learning module that respects anatomical boundaries through iterative Fast Marching refinement, (2) adaptive prototype extraction that captures both global structure and local boundary details via spatially-weighted aggregation, and (3) adaptive parameter learning that automatically adjusts to different organ characteristics. Extensive experiments across diverse medical imaging datasets demonstrate improvements over state-of-the-art methods. Notably, our method reduces boundary errors compared to existing approaches while maintaining computational efficiency, making it highly suitable for clinical applications requiring precise segmentation with limited annotated data.

[174] EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance

Jiahui Wang, Haiyue Zhu, Haoren Guo, Abdullah Al Mamun, Cheng Xiang, Tong Heng Lee

Main category: cs.CV

TL;DR: EPSegFZ is a pre-training-free network for few- and zero-shot 3D point cloud semantic segmentation that uses prototype-enhanced attention and language-guided embeddings to improve performance without relying on pre-training.

DetailsMotivation: Current few-shot 3D point cloud segmentation methods rely heavily on pre-training, limiting flexibility, and fail to fully utilize textual information from support sets, which restricts zero-shot capabilities.

Method: Uses Prototype-Enhanced Registers Attention (ProERA) and Dual Relative Positional Encoding (DRPE) for feature extraction without pre-training, plus Language-Guided Prototype Embedding (LGPE) to leverage textual information for few-shot learning and enable zero-shot inference.

Result: Outperforms state-of-the-art methods by 5.68% on S3DIS and 3.82% on ScanNet benchmarks.

Conclusion: The proposed pre-training-free approach effectively addresses limitations of existing methods by better utilizing support set information and enabling both few-shot and zero-shot segmentation capabilities.

Abstract: Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

[175] Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng, Gangyi Ding

Main category: cs.CV

TL;DR: TASA is a geometry-optimized framework for 3D scene-level affordance segmentation that combines 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner to efficiently identify manipulable points from language instructions.

DetailsMotivation: Existing methods for 3D scene-level affordance understanding focus on object-level affordances or inefficiently lift 2D predictions to 3D, neglecting rich geometric structure information and incurring high computational costs.

Method: TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding task-relevant view selection. It then uses a 3D affordance refinement module to integrate 2D semantic priors with local 3D geometry.

Result: Experiments on SceneFun3D demonstrate that TASA significantly outperforms baselines in both accuracy and efficiency for scene-level affordance segmentation.

Conclusion: TASA effectively addresses limitations of existing methods by jointly leveraging 2D semantic cues and 3D geometric reasoning, achieving superior performance in 3D scene-level affordance understanding from natural language instructions.

Abstract: Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

[176] Scaling Spatial Intelligence with Multimodal Foundation Models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: Scaling up multimodal foundation models to cultivate spatial intelligence through systematic data curation and training, achieving state-of-the-art performance across multiple spatial benchmarks while maintaining strong general multimodal understanding.

DetailsMotivation: Despite progress in multimodal foundation models, they still exhibit surprising deficiencies in spatial intelligence, highlighting the need for specialized approaches to develop robust spatial reasoning capabilities.

Method: Systematically curated SenseNova-SI-8M dataset with eight million diverse data samples under rigorous spatial capability taxonomy, built upon established multimodal foundations including Qwen3-VL, InternVL3, and Bagel models.

Result: Achieved unprecedented performance: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining 84.9% on MMBench-En for general multimodal understanding.

Conclusion: Demonstrated successful cultivation of spatial intelligence through systematic data scaling, showing emergent generalization capabilities and validating downstream application potential. The project is ongoing with publicly released models to facilitate further research.

Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

[177] LE-CapsNet: A Light and Enhanced Capsule Network

Pouya Shiri, Amirali Baniasadi

Main category: cs.CV

TL;DR: LE-CapsNet is a lightweight, enhanced variant of Capsule Network that achieves higher accuracy and 4x faster inference while being more robust to affine transformations.

DetailsMotivation: CapsNet has advantages like better detection of overlapping categories and transformed images, but suffers from slow speed, high resource consumption, and lower accuracy compared to CNNs.

Method: Proposed LE-CapsNet as a light, enhanced variant of CapsNet with optimized structure to reduce parameters and improve performance.

Result: LE-CapsNet achieves 76.73% accuracy on CIFAR-10 with only 3.8M weights and 4x faster inference than CapsNet. It also achieves 94.3% accuracy on AffNIST (vs CapsNet’s 90.52%), showing better robustness to affine transformations.

Conclusion: LE-CapsNet successfully addresses CapsNet’s limitations by providing a more efficient, accurate, and robust alternative that maintains the benefits of capsule networks while significantly improving performance.

Abstract: Capsule Network (CapsNet) classifier has several advantages over CNNs, including better detection of images containing overlapping categories and higher accuracy on transformed images. Despite the advantages, CapsNet is slow due to its different structure. In addition, CapsNet is resource-hungry, includes many parameters and lags in accuracy compared to CNNs. In this work, we propose LE-CapsNet as a light, enhanced and more accurate variant of CapsNet. Using 3.8M weights, LECapsNet obtains 76.73% accuracy on the CIFAR-10 dataset while performing inference 4x faster than CapsNet. In addition, our proposed network is more robust at detecting images with affine transformations compared to CapsNet. We achieve 94.3% accuracy on the AffNIST dataset (compared to CapsNet 90.52%).

[178] Target-Balanced Score Distillation

Zhou Xu, Qi Wang, Yuxiao Yang, Luyuan Zhang, Zhang Liang, Yang Li

Main category: cs.CV

TL;DR: TBSD resolves the trade-off in SDS variants between texture quality and shape distortion by formulating 3D generation as multi-objective optimization with adaptive negative prompt utilization.

DetailsMotivation: Vanilla SDS suffers from over-saturation and over-smoothing, while existing variants with negative prompts face a critical trade-off between limited texture optimization and significant texture gains with shape distortion.

Method: Target-Balanced Score Distillation (TBSD) formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the trade-off by properly utilizing Target Negative Prompts.

Result: Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

Conclusion: TBSD successfully resolves the fundamental trade-off in SDS variants by adaptively balancing texture realism and shape preservation through multi-objective optimization.

Abstract: Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

[179] CompressNAS : A Fast and Efficient Technique for Model Compression using Decomposition

Sudhakar Sah, Nikhil Chabbra, Matthieu Durnerin

Main category: cs.CV

TL;DR: CompressNAS is a neural architecture search framework that globally optimizes rank selection in tensor decomposition to compress CNNs for deployment on resource-constrained devices like MCUs and NPUs.

DetailsMotivation: Deep CNNs are becoming too large and computationally demanding for deployment on microcontrollers and lightweight NPUs, requiring effective compression methods that balance parameter reduction with minimal accuracy loss.

Method: Uses a MicroNAS-inspired framework that treats rank selection as a global search problem, employing fast accuracy estimators to evaluate candidate tensor decompositions (like Tucker factorization) under memory and accuracy constraints.

Result: Achieved 8x compression of ResNet-18 on ImageNet with <4% accuracy drop; 2x compression of YOLOv5s on COCO with no accuracy loss; 2x compression of YOLOv5n with 2.5% accuracy drop. Introduced STResNet family with competitive performance.

Conclusion: CompressNAS provides an effective framework for globally optimizing CNN compression through tensor decomposition, enabling significant model size reduction while maintaining competitive accuracy for deployment on resource-constrained devices.

Abstract: Deep Convolutional Neural Networks (CNNs) are increasingly difficult to deploy on microcontrollers (MCUs) and lightweight NPUs (Neural Processing Units) due to their growing size and compute demands. Low-rank tensor decomposition, such as Tucker factorization, is a promising way to reduce parameters and operations with reasonable accuracy loss. However, existing approaches select ranks locally and often ignore global trade-offs between compression and accuracy. We introduce CompressNAS, a MicroNAS-inspired framework that treats rank selection as a global search problem. CompressNAS employs a fast accuracy estimator to evaluate candidate decompositions, enabling efficient yet exhaustive rank exploration under memory and accuracy constraints. In ImageNet, CompressNAS compresses ResNet-18 by 8x with less than 4% accuracy drop; on COCO, we achieve 2x compression of YOLOv5s without any accuracy drop and 2x compression of YOLOv5n with a 2.5% drop. Finally, we present a new family of compressed models, STResNet, with competitive performance compared to other efficient models.

[180] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai

Main category: cs.CV

TL;DR: MIRROR is a multi-modal self-supervised learning method that integrates histopathology and transcriptomics data for oncology applications, focusing on both modality alignment and retention to handle their pronounced heterogeneity.

DetailsMotivation: Histopathology and transcriptomics provide orthogonal yet complementary insights into cancer, but their inherent heterogeneity makes conventional multi-modal integration methods insufficient as they primarily focus on alignment while neglecting modality-specific structure retention.

Method: MIRROR uses dedicated encoders for each modality, a modality alignment module for integration, a modality retention module to preserve unique attributes, and a style clustering module to reduce redundancy and enhance disease-relevant information through clustering space modeling.

Result: Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis demonstrate MIRROR’s superior performance in constructing comprehensive oncological feature representations.

Conclusion: MIRROR effectively integrates histopathology and transcriptomics data while maintaining modality-specific fidelity, benefiting cancer diagnosis through comprehensive feature representations.

Abstract: Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR’s superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.

[181] AdaptFly: Prompt-Guided Adaptation of Foundation Models for Low-Altitude UAV Networks

Jiao Chen, Haoyi Wang, Jianhua Tang, Junyi Wang

Main category: cs.CV

TL;DR: AdaptFly is a prompt-guided test-time adaptation framework for UAV semantic segmentation that enables lightweight adaptation without weight updates, supporting both resource-limited and resource-massive UAVs through shared knowledge collaboration.

DetailsMotivation: Semantic segmentation models for UAVs deteriorate under weather, lighting, and viewpoint changes, but resource constraints prevent gradient-based adaptation while independent adaptation wastes shared experience across UAV fleets.

Method: Uses two adaptation modes: lightweight token-prompt retrieval from shared memory for resource-limited UAVs, and gradient-free sparse visual prompt optimization via CMA-ES for resource-massive UAVs, with activation-statistic detection and cross-UAV knowledge sharing.

Result: Significantly improves segmentation accuracy and robustness over static models and state-of-the-art TTA baselines on UAVid and VDD benchmarks, with real-world validation under diverse weather conditions.

Conclusion: Provides a practical path to resilient, communication-efficient perception for low-altitude UAV networks through collaborative adaptation with minimal bandwidth overhead.

Abstract: Low-altitude Unmanned Aerial Vehicle (UAV) networks rely on robust semantic segmentation as a foundational enabler for distributed sensing-communication-control co-design across heterogeneous agents within the network. However, segmentation foundation models deteriorate quickly under weather, lighting, and viewpoint drift. Resource-limited UAVs cannot run gradient-based test-time adaptation, while resource-massive UAVs adapt independently, wasting shared experience. To address these challenges, we propose AdaptFly, a prompt-guided test-time adaptation framework that adjusts segmentation models without weight updates. AdaptFly features two complementary adaptation modes. For resource-limited UAVs, it employs lightweight token-prompt retrieval from a shared global memory. For resource-massive UAVs, it uses gradient-free sparse visual prompt optimization via Covariance Matrix Adaptation Evolution Strategy. An activation-statistic detector triggers adaptation, while cross-UAV knowledge pool consolidates prompt knowledge and enables fleet-wide collaboration with negligible bandwidth overhead. Extensive experiments on UAVid and VDD benchmarks, along with real-world UAV deployments under diverse weather conditions, demonstrate that AdaptFly significantly improves segmentation accuracy and robustness over static models and state-of-the-art TTA baselines. The results highlight a practical path to resilient, communication-efficient perception in the emerging low-altitude economy.

[182] Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

Zekai Shi, Zhixi Cai, Kalin Stefanov

Main category: cs.CV

TL;DR: A biologically plausible masking strategy for visual representation learning that mimics human visual processing, enabling effective word-referent mapping from child development data.

DetailsMotivation: Children learn words by linking spoken utterances to visual referents, but face the challenge of interpreting new words from countless possible meanings. The research aims to develop a biologically plausible learning strategy that mimics human visual processing.

Method: Uses masked autoencoder with novel masking strategy based on human eye blind spot knowledge, mimicking how the brain fills visual gaps. The pretrained encoder is then used in contrastive learning-based video-text model for word-referent mapping.

Result: The proposed biologically plausible masking strategy is at least as effective as standard random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

Conclusion: Biologically inspired masking strategies can be equally effective as standard approaches while being more biologically justified, offering promising directions for developmental AI systems.

Abstract: Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes’ field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

[183] GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion

Yongjun Xiao, Dian Meng, Xinlei Huang, Yanran Liu, Shiwei Ruan, Ziyue Qiao, Xubin Zheng

Main category: cs.CV

TL;DR: GROVER is a novel framework for adaptive integration of spatial multi-omics data that addresses challenges in multimodal fusion through graph-guided representation learning, contrastive alignment, and dynamic expert routing.

DetailsMotivation: Spatial omics data lacks pathological morphological context from histopathological images, and integrating these multimodal sources is essential for comprehensive tissue analysis but challenging due to heterogeneity, resolution mismatches, and biological perturbations.

Method: Uses Graph Convolutional Network encoder based on Kolmogorov-Arnold Networks to capture nonlinear dependencies, spot-feature-pair contrastive learning for cross-modal alignment, and dynamic expert routing mechanism to adaptively select informative modalities per spot.

Result: GROVER outperforms state-of-the-art baselines on real-world spatial omics datasets, demonstrating robust and reliable multimodal integration performance.

Conclusion: GROVER provides an effective solution for adaptive integration of spatial multi-omics data, successfully addressing key challenges in multimodal fusion through its novel architectural components.

Abstract: Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is therefore essential for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov-Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.

[184] Exposing DeepFakes via Hyperspectral Domain Mapping

Aditya Mehta, Swarnim Chaudhary, Pratik Narang, Jagat Sesh Challa

Main category: cs.CV

TL;DR: HSI-Detect is a two-stage pipeline that converts RGB images to 31-channel hyperspectral images to enhance Deepfake detection by revealing manipulation artifacts in specific frequency bands.

DetailsMotivation: Modern generative models create highly realistic images that can fool both human perception and automated detection systems, which typically only analyze three RGB channels.

Method: Propose HSI-Detect, a two-stage pipeline that reconstructs 31-channel hyperspectral images from standard RGB input and performs detection in the hyperspectral domain.

Result: Evaluation on FaceForensics++ dataset shows consistent improvements over RGB-only baselines, with hyperspectral imaging amplifying manipulation artifacts that are weak or invisible in RGB.

Conclusion: Spectral-domain mapping shows promise for Deepfake detection by expanding input representation into denser spectral bands to reveal hidden manipulation artifacts.

Abstract: Modern generative and diffusion models produce highly realistic images that can mislead human perception and even sophisticated automated detection systems. Most detection methods operate in RGB space and thus analyze only three spectral channels. We propose HSI-Detect, a two-stage pipeline that reconstructs a 31-channel hyperspectral image from a standard RGB input and performs detection in the hyperspectral domain. Expanding the input representation into denser spectral bands amplifies manipulation artifacts that are often weak or invisible in the RGB domain, particularly in specific frequency bands. We evaluate HSI-Detect across FaceForensics++ dataset and show the consistent improvements over RGB-only baselines, illustrating the promise of spectral-domain mapping for Deepfake detection.

[185] Toward bilipshiz geometric models

Yonatan Sverdlov, Eitan Rosen, Nadav Dym

Main category: cs.CV

TL;DR: The paper examines whether invariant neural networks for point clouds preserve symmetry-aware distances through bi-Lipschitz equivalence, showing standard networks fail this property and proposing modifications to achieve it.

DetailsMotivation: Recent work in Equivariant learning highlights advantages of bi-Lipschitz models, motivating examination of whether invariant point cloud networks preserve natural symmetry-aware distances.

Method: Analyzed two symmetry-aware metrics (Procrustes Matching and Hard Gromov Wasserstein), showed they’re not bi-Lipschitz equivalent, modified standard invariant networks to achieve bi-Lipschitz guarantees.

Result: Standard invariant networks are not bi-Lipschitz with respect to PM metric, but modified versions can obtain bi-Lipschitz guarantees and show advantages in correspondence tasks.

Conclusion: Proposed bi-Lipschitz invariant models outperform standard invariant models for finding correspondences between 3D point clouds.

Abstract: Many neural networks for point clouds are, by design, invariant to the symmetries of this datatype: permutations and rigid motions. The purpose of this paper is to examine whether such networks preserve natural symmetry aware distances on the point cloud spaces, through the notion of bi-Lipschitz equivalence. This inquiry is motivated by recent work in the Equivariant learning literature which highlights the advantages of bi-Lipschitz models in other scenarios. We consider two symmetry aware metrics on point clouds: (a) The Procrustes Matching (PM) metric and (b) Hard Gromov Wasserstien distances. We show that these two distances themselves are not bi-Lipschitz equivalent, and as a corollary deduce that popular invariant networks for point clouds are not bi-Lipschitz with respect to the PM metric. We then show how these networks can be modified so that they do obtain bi-Lipschitz guarantees. Finally, we provide initial experiments showing the advantage of the proposed bi-Lipschitz model over standard invariant models, for the tasks of finding correspondences between 3D point clouds.

[186] Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

Sanchit Sinha, Guangzhi Xiong, Zhenghao He, Aidong Zhang

Main category: cs.CV

TL;DR: Concept-RuleNet is a neurosymbolic system that combines visual concept mining with symbolic reasoning to improve interpretability and reduce hallucinations in vision-language models.

DetailsMotivation: Current VLMs lack interpretability and hallucinate facts, especially on out-of-distribution data. Neurosymbolic methods exist but their symbols are weakly grounded in visual data, being extracted only from task labels.

Method: Multi-agent system with: 1) multimodal concept generator that mines visual concepts from training images, 2) symbol discovery conditioned on visual concepts, 3) LLM reasoner that composes symbols into first-order rules, 4) vision verifier that quantifies symbol presence during inference.

Result: Outperforms state-of-the-art neurosymbolic baselines by average 5% on five benchmarks (including medical imaging and underrepresented datasets), reduces hallucinated symbols in rules by up to 50%.

Conclusion: The system successfully reinstates visual grounding while maintaining transparent reasoning, providing explicit reasoning pathways and reducing hallucinations in vision-language tasks.

Abstract: Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into ‘why’ a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

[187] Batch Transformer Architecture: Case of Synthetic Image Generation for Emotion Expression Facial Recognition

Stanislav Selitskiy

Main category: cs.CV

TL;DR: Proposes Batch Transformers with implicit sparse attention to important dimensions, reducing bottleneck size in encoder-decoder architectures for improved synthetic face image generation with makeup and occlusion.

DetailsMotivation: To reduce computational bottlenecks in Transformer architectures by focusing attention on important dimensions rather than entire sequences, enabling more efficient processing for tasks like face recognition with limited data.

Method: Uses implicit sparse attention mechanism that selects primary components/dimensions instead of attending to entire sequences. Applied to encoder-decoder ANN architectures for synthetic image generation, specifically tested on face recognition with makeup and occlusion datasets.

Result: Achieved significant reduction in bottleneck size while increasing variability of limited original datasets for face recognition tasks involving makeup and occlusion.

Conclusion: Batch Transformers with dimension-wise attention provide an efficient alternative to traditional Transformers, enabling better performance on data-limited tasks through improved feature selection and reduced computational requirements.

Abstract: A novel Transformer variation architecture is proposed in the implicit sparse style. Unlike “traditional” Transformers, instead of attention to sequential or batch entities in their entirety of whole dimensionality, in the proposed Batch Transformers, attention to the “important” dimensions (primary components) is implemented. In such a way, the “important” dimensions or feature selection allows for a significant reduction of the bottleneck size in the encoder-decoder ANN architectures. The proposed architecture is tested on the synthetic image generation for the face recognition task in the case of the makeup and occlusion data set, allowing for increased variability of the limited original data set.

[188] Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade

Main category: cs.CV

TL;DR: Image-POSER is a reinforcement learning framework that orchestrates multiple text-to-image and image-to-image models to handle long, compositional prompts through dynamic task decomposition and structured feedback.

DetailsMotivation: Current text-to-image models struggle with long, compositional prompts typical of creative workflows, requiring a system that can reliably execute complex multi-step image generation tasks.

Method: Uses reflective reinforcement learning to orchestrate a registry of pretrained experts, handles prompts through dynamic task decomposition, and supervises alignment via structured feedback from a vision-language model critic, casting image synthesis as a Markov Decision Process.

Result: Outperforms baselines including frontier models across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations.

Conclusion: Reinforcement learning can enable AI systems to autonomously decompose, reorder, and combine visual models, advancing towards general-purpose visual assistants.

Abstract: Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

[189] SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

Zhongping Dong, Pengyang Yu, Shuangjian Li, Liming Chen, Mohand Tahar Kechadi

Main category: cs.CV

TL;DR: SOTFormer is a constant-memory temporal transformer that unifies object detection, tracking, and short-term trajectory prediction in a single end-to-end framework, achieving real-time performance with stable identity propagation.

DetailsMotivation: Address challenges in single-object tracking and motion forecasting under occlusion, scale variation, and temporal drift that disrupt temporal coherence for real-time perception.

Method: Uses a ground-truth-primed memory and burn-in anchor loss for stable initialization, with a single lightweight temporal-attention layer that refines embeddings across frames for fixed GPU memory usage.

Result: Achieves 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM) on Mini-LaSOT benchmark, outperforming transformer baselines like TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

Conclusion: SOTFormer demonstrates effective unification of detection, tracking, and prediction with constant memory requirements and real-time performance, providing robust handling of challenging scenarios.

Abstract: Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

[190] MP-GFormer: A 3D-Geometry-Aware Dynamic Graph Transformer Approach for Machining Process Planning

Fatemeh Elhambakhsh, Gaurav Ameta, Aditi Roy, Hyunwoong Ko

Main category: cs.CV

TL;DR: MP-GFormer is a 3D-geometry-aware dynamic graph transformer that integrates evolving 3D geometric representations to predict machining operation sequences, achieving significant improvements over state-of-the-art methods.

DetailsMotivation: Existing dynamic graph learning approaches in machining process planning fail to incorporate 3D geometric information of parts, lacking domain awareness in predicting machining operation sequences.

Method: Proposes MP-GFormer, a dynamic graph transformer that integrates evolving 3D geometric representations through attention mechanism using StereoLithography surface meshes and boundary representation for initial designs.

Result: Achieves 24% improvement in accuracy for main operation predictions and 36% improvement for sub-operation predictions compared to state-of-the-art approaches on synthesized dataset.

Conclusion: Integrating 3D geometric information into dynamic graph learning significantly enhances machining operation sequence prediction accuracy, demonstrating the importance of domain-aware geometric representations.

Abstract: Machining process planning (MP) is inherently complex due to structural and geometrical dependencies among part features and machining operations. A key challenge lies in capturing dynamic interdependencies that evolve with distinct part geometries as operations are performed. Machine learning has been applied to address challenges in MP, such as operation selection and machining sequence prediction. Dynamic graph learning (DGL) has been widely used to model dynamic systems, thanks to its ability to integrate spatio-temporal relationships. However, in MP, while existing DGL approaches can capture these dependencies, they fail to incorporate three-dimensional (3D) geometric information of parts and thus lack domain awareness in predicting machining operation sequences. To address this limitation, we propose MP-GFormer, a 3D-geometry-aware dynamic graph transformer that integrates evolving 3D geometric representations into DGL through an attention mechanism to predict machining operation sequences. Our approach leverages StereoLithography surface meshes representing the 3D geometry of a part after each machining operation, with the boundary representation method used for the initial 3D designs. We evaluate MP-GFormer on a synthesized dataset and demonstrate that the method achieves improvements of 24% and 36% in accuracy for main and sub-operation predictions, respectively, compared to state-of-the-art approaches.

[191] DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal

Jialang Lu, Shuning Sun, Pu Wang, Chen Wu, Feng Gao, Lina Gong, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

Main category: cs.CV

TL;DR: DCA-LUT is a deep learning framework for purple fringing removal using chromatic-aware coordinate transformation and 5D LUT for color correction, achieving state-of-the-art results.

DetailsMotivation: Purple fringing caused by Longitudinal Chromatic Aberration degrades image quality, and traditional solutions rely on expensive hardware or handcrafted features, ignoring data-driven approaches.

Method: Introduces Chromatic-Aware Coordinate Transformation module to learn image-adaptive color space, isolating fringing into a dedicated dimension, then uses 5D Look-Up Table for color correction. Created PF-Synth dataset for training.

Result: Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art performance in purple fringing removal.

Conclusion: DCA-LUT provides an effective deep learning solution for purple fringing removal that outperforms traditional methods.

Abstract: Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem, the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise ``purple fringe channel", which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful% non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.

[192] Defending Unauthorized Model Merging via Dual-Stage Weight Protection

Wei-Jia Chen, Min-Yen Tsai, Cheng-Yi Lee, Chia-Mu Yu

Main category: cs.CV

TL;DR: MergeGuard is a dual-stage weight protection framework that prevents unauthorized model merging by disrupting merging compatibility while maintaining task performance.

DetailsMotivation: The proliferation of pretrained models and open repositories enables unauthorized model merging, which violates intellectual property rights and undermines model ownership and accountability.

Method: Two-stage approach: (1) redistribute task-relevant information across layers via L2-regularized optimization, (2) inject structured perturbations to misalign task subspaces and break curvature compatibility in the loss landscape.

Result: Extensive experiments on vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models show MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

Conclusion: MergeGuard effectively protects models from unauthorized merging by reshaping parameter geometry to cause destructive interference in merged models while maintaining original model functionality.

Abstract: The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model’s parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

[193] Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment

Tolga Demiroglu, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Main category: cs.CV

TL;DR: A prompt-conditioned framework using MedSigLIP with FiLM and multi-scale pooling for LDCT quality assessment, achieving state-of-the-art results on LDCTIQA2023 challenge.

DetailsMotivation: To enable data-efficient learning and rapid adaptation in medical image quality assessment by conditioning patch-token features on clinical intent through text prompts.

Method: Built on MedSigLIP with Feature-wise Linear Modulation (FiLM) for injecting textual priors, combined with multi-scale pooling (global, local, texture-aware) through separate regression heads fused by lightweight MLP, trained with pairwise ranking loss.

Result: Achieved PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301 on LDCTIQA2023 with only 1,000 training images, surpassing top-ranked published challenge submissions.

Conclusion: The prompt-guided approach demonstrates effectiveness in medical image quality assessment, enabling superior performance with limited training data through clinical intent conditioning.

Abstract: We propose a prompt-conditioned framework built on MedSigLIP that injects textual priors via Feature-wise Linear Modulation (FiLM) and multi-scale pooling. Text prompts condition patch-token features on clinical intent, enabling data-efficient learning and rapid adaptation. The architecture combines global, local, and texture-aware pooling through separate regression heads fused by a lightweight MLP, trained with pairwise ranking loss. Evaluated on the LDCTIQA2023 (a public LDCT quality assessment challenge) with 1,000 training images, we achieve PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301, surpassing the top-ranked published challenge submissions and demonstrating the effectiveness of our prompt-guided approach.

[194] FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision

Muzammal Shafique, Nasir Rahim, Jamil Ahmad, Mohammad Siadat, Khalid Malik, Ghaus Malik

Main category: cs.CV

TL;DR: FocusSDF is a novel boundary-aware loss function for medical image segmentation that uses signed distance functions to prioritize boundary regions, outperforming existing distance-based loss functions across multiple medical imaging tasks.

DetailsMotivation: Most medical image segmentation models lack explicit boundary encoding, leading to persistent challenges in boundary preservation despite overall segmentation progress.

Method: Proposed FocusSDF loss function based on signed distance functions that adaptively assigns higher weights to pixels closer to lesion/organ boundaries, making the network boundary-aware.

Result: Extensive evaluations across 5 state-of-the-art models including MedSAM, using 4 distance-based loss functions on diverse medical datasets (cerebral aneurysm, stroke, liver, breast tumor) consistently showed FocusSDF’s superior performance.

Conclusion: FocusSDF effectively addresses boundary preservation challenges in medical image segmentation and demonstrates consistent superiority over existing distance transform based loss functions.

Abstract: Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.

[195] Lacking Data? No worries! How synthetic images can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus)

Simon Durand, Samuel Foucher, Alexandre Delplanque, Joëlle Taillon, Jérôme Théau

Main category: cs.CV

TL;DR: Study shows synthetic imagery improves muskox detection in zero-shot and few-shot settings when training data is limited, with diminishing returns beyond 100% synthetic data supplementation.

DetailsMotivation: Traditional wildlife monitoring methods are resource-intensive and limited by logistical challenges, especially for sparsely distributed species like muskoxen. Deep learning models face limitations due to small datasets.

Method: Compared baseline model trained on real imagery with 5 zero-shot and 5 few-shot models incorporating progressively more synthetic imagery. Zero-shot used only synthetic data, while few-shot combined real and synthetic images.

Result: Zero-shot models showed improved detection performance with added synthetic imagery, plateauing beyond 100% supplementation. Few-shot models achieved better recall and slightly higher accuracy with synthetic-real combinations, though not statistically significant.

Conclusion: Synthetic imagery enables accurate object detection models for wildlife monitoring when real data is scarce, allowing monitoring of rare/inaccessible species and increased monitoring frequency without requiring initial real data.

Abstract: Accurate population estimates are essential for wildlife management, providing critical insights into species abundance and distribution. Traditional survey methods, including visual aerial counts and GNSS telemetry tracking, are widely used to monitor muskox populations in Arctic regions. These approaches are resource intensive and constrained by logistical challenges. Advances in remote sensing, artificial intelligence, and high resolution aerial imagery offer promising alternatives for wildlife detection. Yet, the effectiveness of deep learning object detection models (ODMs) is often limited by small datasets, making it challenging to train robust ODMs for sparsely distributed species like muskoxen. This study investigates the integration of synthetic imagery (SI) to supplement limited training data and improve muskox detection in zero shot (ZS) and few-shot (FS) settings. We compared a baseline model trained on real imagery with 5 ZS and 5 FS models that incorporated progressively more SI in the training set. For the ZS models, where no real images were included in the training set, adding SI improved detection performance. As more SI were added, performance in precision, recall and F1 score increased, but eventually plateaued, suggesting diminishing returns when SI exceeded 100% of the baseline model training dataset. For FS models, combining real and SI led to better recall and slightly higher overall accuracy compared to using real images alone, though these improvements were not statistically significant. Our findings demonstrate the potential of SI to train accurate ODMs when data is scarce, offering important perspectives for wildlife monitoring by enabling rare or inaccessible species to be monitored and to increase monitoring frequency. This approach could be used to initiate ODMs without real data and refine it as real images are acquired over time.

[196] Advancing Annotat3D with Harpia: A CUDA-Accelerated Library For Large-Scale Volumetric Data Segmentation

Camila Machado de Araujo, Egon P. B. S. Borges, Ricardo Marcelo Canteiro Grangeiro, Allan Pinto

Main category: cs.CV

TL;DR: Introduces Harpia, a CUDA-based library for Annotat3D that enables scalable, interactive segmentation of large 3D datasets in HPC environments with strict memory control and GPU acceleration.

DetailsMotivation: High-resolution volumetric imaging generates massive datasets that challenge existing tools for efficient processing, segmentation, and interactive exploration in scientific imaging workflows.

Method: Developed Harpia - a CUDA-based processing library with strict memory control, native chunked execution, GPU-accelerated filtering, annotation, and quantification tools for handling datasets exceeding single-GPU memory capacity.

Result: Significant improvements in processing speed, memory efficiency, and scalability compared to NVIDIA cuCIM and scikit-image frameworks. Enables reliable operation on large datasets with interactive human-in-the-loop interface.

Conclusion: Harpia’s efficient GPU resource management and interactive interface make it particularly suitable for collaborative scientific imaging workflows in shared HPC infrastructures.

Abstract: High-resolution volumetric imaging techniques, such as X-ray tomography and advanced microscopy, generate increasingly large datasets that challenge existing tools for efficient processing, segmentation, and interactive exploration. This work introduces new capabilities to Annotat3D through Harpia, a new CUDA-based processing library designed to support scalable, interactive segmentation workflows for large 3D datasets in high-performance computing (HPC) and remote-access environments. Harpia features strict memory control, native chunked execution, and a suite of GPU-accelerated filtering, annotation, and quantification tools, enabling reliable operation on datasets exceeding single-GPU memory capacity. Experimental results demonstrate significant improvements in processing speed, memory efficiency, and scalability compared to widely used frameworks such as NVIDIA cuCIM and scikit-image. The system’s interactive, human-in-the-loop interface, combined with efficient GPU resource management, makes it particularly suitable for collaborative scientific imaging workflows in shared HPC infrastructures.

[197] C3Net: Context-Contrast Network for Camouflaged Object Detection

Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, Saeed Anwar

Main category: cs.CV

TL;DR: C3Net is a novel dual-pathway decoder architecture for camouflaged object detection that addresses six fundamental challenges through specialized components including Edge Refinement Pathway, Contextual Localization Pathway, and Attentive Fusion Module.

DetailsMotivation: Camouflaged object detection is challenging because objects blend seamlessly with surroundings through similar colors, textures, and patterns, causing failures in both traditional segmentation methods and modern foundation models.

Method: Proposes C3Net with dual-pathway decoder: Edge Refinement Pathway uses gradient-initialized Edge Enhancement Modules for boundary recovery, and Contextual Localization Pathway employs Image-based Context Guidance for intrinsic saliency suppression. An Attentive Fusion Module combines both pathways via spatial gating.

Result: Achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K datasets while maintaining efficient processing.

Conclusion: Complex, multifaceted detection challenges require architectural innovation with specialized components working synergistically for comprehensive coverage beyond isolated improvements.

Abstract: Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at https://github.com/Baber-Jan/C3Net.

[198] Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks

Arnav Singhvi, Vasiliki Bikia, Asad Aali, Akshay Chaudhari, Roxana Daneshjou

Main category: cs.CV

TL;DR: Automated prompt optimization using DSPy framework significantly improves medical vision-language model performance, achieving 53% median relative improvement over zero-shot baselines without requiring large datasets or manual prompt engineering.

DetailsMotivation: Vision-language models underperform on medical tasks, and current solutions like finetuning require large datasets/compute while manual prompt engineering is hard to generalize and inaccessible to medical institutions.

Method: Adapted DSPy framework for automated prompt optimization, implementing pipelines for 5 medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with 4 prompt optimization techniques.

Result: Optimized pipelines achieved 53% median relative improvement over zero-shot baselines, with largest gains from 300% to 3,400% on tasks with low zero-shot performance. Approach is scalable, preserves data privacy, and works with open-source models.

Conclusion: Automated prompt optimization has substantial potential for medical AI systems, enabling significant performance gains while reducing dependence on manual prompt design, allowing clinicians to focus on patient care.

Abstract: Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model’s embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at https://github.com/DaneshjouLab/prompt-triage-lab.

[199] MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

Leena Alghamdi, Muhammad Usman, Hafeez Anwar, Abdul Bais, Saeed Anwar

Main category: cs.CV

TL;DR: Proposes a Multi-Scale Recursive Network (MSRNet) for camouflaged object detection using Pyramid Vision Transformer backbone with attention-based scale integration and recursive-feedback decoding to handle small and multiple objects in complex scenarios.

DetailsMotivation: Current camouflaged object detection methods struggle with precise detection in complex scenarios, especially with small and multiple objects, indicating need for improved methods that can handle objects blending seamlessly into environments.

Method: Uses Pyramid Vision Transformer backbone for multi-scale feature extraction, Attention-Based Scale Integration Units for selective feature merging, and recursive-feedback decoding with Multi-Granularity Fusion Units for precise object detection.

Result: Achieves state-of-the-art results on two benchmark datasets and ranks second on two other datasets, successfully detecting small and multiple camouflaged objects.

Conclusion: The proposed MSRNet effectively addresses challenges in camouflaged object detection through multi-scale learning and recursive feature optimization, demonstrating improved performance in complex scenarios.

Abstract: Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

[200] PI-NAIM: Path-Integrated Neural Adaptive Imputation Model

Afifa Khaled, Ebrahim Hamid Sumiea

Main category: cs.CV

TL;DR: PI-NAIM is a dual-path architecture that dynamically routes samples to optimal imputation methods based on missingness complexity, combining statistical and neural approaches for medical data with missing modalities.

DetailsMotivation: Address the challenge of missing modalities in medical imaging and multi-modal clinical settings, where existing imputation methods lack representational capacity or are computationally expensive.

Method: Dual-path architecture with intelligent routing (low missingness to MICE, complex patterns to GAIN with temporal analysis), cross-path attention fusion using missingness-aware embeddings, and end-to-end joint optimization.

Result: State-of-the-art performance with RMSE of 0.108 (vs. baselines’ 0.119-0.152) and AUROC of 0.812 for mortality prediction on MIMIC-III and multimodal benchmarks.

Conclusion: PI-NAIM provides a unified solution for real-world scenarios with incomplete data, enabling seamless integration into vision pipelines handling missing modalities or corrupted inputs.

Abstract: Medical imaging and multi-modal clinical settings often face the challange of missing modality in their diagnostic pipelines. Existing imputation methods either lack representational capacity or are computationally expensive. We propose PI-NAIM, a novel dual-path architecture that dynamically routes samples to optimized imputation approaches based on missingness complexity. Our framework integrates: (1) intelligent path routing that directs low missingness samples to efficient statistical imputation (MICE) and complex patterns to powerful neural networks (GAIN with temporal analysis); (2) cross-path attention fusion that leverages missingness-aware embeddings to intelligently combine both branches; and (3) end-to-end joint optimization of imputation accuracy and downstream task performance. Extensive experiments on MIMIC-III and multimodal benchmarks demonstrate state-of-the-art performance, achieving RMSE of 0.108 (vs. baselines’ 0.119-0.152) and substantial gains in downstream tasks with an AUROC of 0.812 for mortality prediction. PI-NAIM’s modular design enables seamless integration into vision pipelines handling incomplete sensor measurements, missing modalities, or corrupted inputs, providing a unified solution for real-world scenario. The code is publicly available at https://github.com/AfifaKhaled/PI-NAIM-Path-Integrated-Neural-Adaptive-Imputation-Model

[201] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

Main category: cs.CV

TL;DR: QTSplus is a lightweight visual token selection module that dynamically selects important video tokens based on text queries, reducing vision tokens by 89% and latency by 28% while maintaining near-parity accuracy on long video understanding tasks.

DetailsMotivation: Long video understanding faces challenges due to linearly growing vision tokens causing attention cost explosion, memory issues, and latency problems in multimodal large language models.

Method: QTSplus uses cross-attention scoring, predicts instance-specific retention budgets based on query complexity, and selects top-n tokens with differentiable straight-through estimator during training and hard gate at inference, plus a small re-encoder for temporal order preservation.

Result: Achieves 89% vision stream compression and 28% latency reduction on long videos, with near-parity accuracy overall and significant improvements (+20.5 and +5.6 points) on TempCompass direction and order accuracies compared to original Qwen models.

Conclusion: QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence, demonstrating practical efficiency gains without sacrificing performance.

Abstract: Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89%} and reduces end-to-end latency by \textbf{28%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence. We will make all code, data, and trained models’ weights publicly available.

[202] From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

Ling Wang, Yunfan Lu, Wenzong Ma, Huizai Yao, Pengteng Li, Hui Xiong

Main category: cs.CV

TL;DR: First use of event cameras for image dehazing, leveraging their high dynamic range (HDR) capabilities through a diffusion model that transfers HDR structural cues from events to generate clear images from hazy inputs.

DetailsMotivation: Traditional RGB-based dehazing methods suffer from limited dynamic range, making dehazing ill-posed and prone to erasing structure and illumination details. Event cameras offer superior HDR (120 dB vs 60 dB) and microsecond latency, making them ideal for hazy scenes.

Method: Proposed an event-guided diffusion model that maps sparse HDR event features (edges, corners) into diffusion latent space. This provides precise structural guidance during generation to transfer HDR information from events to frames.

Result: Achieved state-of-the-art results on two benchmarks and a newly collected drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors.

Conclusion: Event cameras combined with diffusion models effectively address dehazing challenges by leveraging HDR structural cues, improving visual realism and reducing semantic drift in hazy conditions.

Abstract: Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbf{first time}. Event cameras offer much higher HDR ($120 dBvs.60 dB$) and microsecond latency, therefore they suit hazy scenes. In practice, transferring HDR cues from events to frames is hard because real paired data are scarce. To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events. Specifically, we design an event-guided module that maps sparse HDR event features, \textit{e.g.,} edges, corners, into the diffusion latent space. This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift. For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.

[203] Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs

Leonardi Melo, Luís Gustavo, Dimmy Magalhães, Lucciani Vieira, Mauro Araújo

Main category: cs.CV

TL;DR: Comparative analysis of three U-Net variants for rock art petroglyph segmentation, showing attention-enhanced architectures outperform baseline with 2.5-2.9% Dice Score improvement.

DetailsMotivation: To develop effective semantic segmentation methods for digital preservation of Brazilian archaeological rock art petroglyphs using deep learning approaches.

Method: Evaluated three U-Net architectures: baseline BEGL-UNet, Attention-Residual BEGL-UNet with residual blocks and gated attention, and Spatial Channel Attention BEGL-UNet using CBAM modules. All used BEGL loss combining binary cross-entropy with Gaussian edge enhancement. Tested on Poço da Bebidinha Archaeological Complex images with 5-fold cross-validation.

Result: Attention-Residual BEGL-UNet achieved best performance: Dice Score 0.710, validation loss 0.067, recall 0.854. Spatial Channel Attention BEGL-UNet: DSC 0.707, recall 0.857. Baseline BEGL-UNet: DSC 0.690. Attention mechanisms improved DSC by 2.5-2.9% over baseline.

Conclusion: Attention mechanisms significantly enhance semantic segmentation performance for archaeological heritage preservation, with Attention-Residual BEGL-UNet being the most effective architecture for rock art petroglyph segmentation.

Abstract: This study presents a comparative analysis of three U-Net-based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL-UNet with Border-Enhanced Gaussian Loss function; (2) Attention-Residual BEGL-UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL-UNet, which employs spatial-channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross-entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5-fold cross-validation. Among the architectures, Attention-Residual BEGL-UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL-UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL-UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5-2.9% over the baseline.

[204] From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Junchao Zhu, Haibo Wang, Daniel Reisenbüchler, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Steven Salvatoree, Surya Seshane, Mert R. Sabuncu, Yihe Yang, Ruining Deng

Main category: cs.CV

TL;DR: This paper evaluates vision-language models for fine-grained glomerular subtyping in kidney biopsy interpretation under few-shot learning constraints, finding that pathology-specialized models with vanilla fine-tuning perform best even with limited labeled data.

DetailsMotivation: Fine-grained glomerular subtyping is clinically important but suffers from scarce labels. Existing approaches use coarse classification with image-only models, leaving unclear how VLMs should be adapted for meaningful subtyping under data constraints.

Method: Model fine-grained glomerular subtyping as few-shot problem; systematically evaluate pathology-specialized and general-purpose VLMs; analyze classification performance and representation geometry including feature alignment and subtype separability.

Result: Pathology-specialized VLMs with vanilla fine-tuning are most effective, achieving substantial gains in discrimination and calibration with only 4-8 examples per subtype. Discrimination between positive/negative examples is as important as image-text alignment.

Conclusion: Supervision level and adaptation strategy jointly shape diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment in clinical settings.

Abstract: Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.

[205] BeyondFacial: Identity-Preserving Personalized Generation Beyond Facial Close-ups

Songsong Zhang, Chuanqi Tang, Hongguang Zhang, Guijian Tang, Minglong Li, Xueqiong Li, Shaowu Yang, Yuanxi Peng, Wenjing Yang, Jing Zhao

Main category: cs.CV

TL;DR: A novel IPPG method that overcomes facial close-up limitations by separating identity and semantics, enabling better scene generation while preserving identity fidelity.

DetailsMotivation: Existing IPPG methods overemphasize facial regions, resulting in outputs dominated by facial close-ups with weak visual narrativity and poor semantic consistency under complex text prompts.

Method: Proposes Dual-Line Inference pipeline with identity-semantic separation, Identity Adaptive Fusion strategy for deferred fusion, and Identity Aggregation Prepending module to replace random initializations.

Result: Achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning.

Conclusion: The method breaks facial close-up constraints, facilitates film-level character-scene creation, and provides richer personalized generation capabilities as a plug-and-play component.

Abstract: Identity-Preserving Personalized Generation (IPPG) has advanced film production and artistic creation, yet existing approaches overemphasize facial regions, resulting in outputs dominated by facial close-ups.These methods suffer from weak visual narrativity and poor semantic consistency under complex text prompts, with the core limitation rooted in identity (ID) feature embeddings undermining the semantic expressiveness of generative models. To address these issues, this paper presents an IPPG method that breaks the constraint of facial close-ups, achieving synergistic optimization of identity fidelity and scene semantic creation. Specifically, we design a Dual-Line Inference (DLI) pipeline with identity-semantic separation, resolving the representation conflict between ID and semantics inherent in traditional single-path architectures. Further, we propose an Identity Adaptive Fusion (IdAF) strategy that defers ID-semantic fusion to the noise prediction stage, integrating adaptive attention fusion and noise decision masking to avoid ID embedding interference on semantics without manual masking. Finally, an Identity Aggregation Prepending (IdAP) module is introduced to aggregate ID information and replace random initializations, further enhancing identity preservation. Experimental results validate that our method achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning. As a plug-and-play component, it can be rapidly deployed in existing IPPG frameworks, addressing the over-reliance on facial close-ups, facilitating film-level character-scene creation, and providing richer personalized generation capabilities for related domains.

[206] Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

Jiaming Liang, Chi-Man Pun

Main category: cs.CV

TL;DR: The paper addresses limitations in transformation-based adversarial attacks by revealing dynamic transferability patterns and proposing an efficient parameter optimization method that reduces complexity from O(m^n) to O(nlogm).

DetailsMotivation: Existing transformation-based attacks suffer from blind spots in parameter optimization: they only consider low-iteration settings, use uniform parameters across different scenarios, and rely on computationally expensive grid search methods.

Method: The authors conduct empirical studies revealing three dynamic patterns of transferability, propose a Concentric Decay Model (CDM) to explain these patterns, and develop Dynamic Parameter Optimization (DPO) based on the rise-then-fall pattern.

Result: Comprehensive experiments show that DPO significantly improves transferability across different surrogate models, iterations, and tasks while reducing computational complexity.

Conclusion: The proposed DPO method effectively addresses the limitations of existing transformation-based attacks by providing an efficient parameter optimization approach that enhances transferability with reduced computational overhead.

Abstract: Despite their wide application, the vulnerabilities of deep neural networks raise societal concerns. Among them, transformation-based attacks have demonstrated notable success in transfer attacks. However, existing attacks suffer from blind spots in parameter optimization, limiting their full potential. Specifically, (1) prior work generally considers low-iteration settings, yet attacks perform quite differently at higher iterations, so characterizing overall performance based only on low-iteration results is misleading. (2) Existing attacks use uniform parameters for different surrogate models, iterations, and tasks, which greatly impairs transferability. (3) Traditional transformation parameter optimization relies on grid search. For n parameters with m steps each, the complexity is O(mn). Large computational overhead limits further optimization of parameters. To address these limitations, we conduct an empirical study with various transformations as baselines, revealing three dynamic patterns of transferability with respect to parameter strength. We further propose a novel Concentric Decay Model (CDM) to effectively explain these patterns. Building on these insights, we propose an efficient Dynamic Parameter Optimization (DPO) based on the rise-then-fall pattern, reducing the complexity to O(nlogm). Comprehensive experiments on existing transformation-based attacks across different surrogate models, iterations, and tasks demonstrate that our DPO can significantly improve transferability.

[207] LithoSeg: A Coarse-to-Fine Framework for High-Precision Lithography Segmentation

Xinyu He, Botong Zhao, Bingbing Li, Shujing Lyu, Jiwei Shen, Yue Lu

Main category: cs.CV

TL;DR: LithoSeg: A coarse-to-fine network for precise lithography SEM image segmentation using SAM with human-in-the-loop bootstrapping and 1D regression refinement.

DetailsMotivation: Accurate segmentation of lithography SEM images is crucial for semiconductor process control and yield optimization, but existing methods lack precision and robustness for diverse pattern geometries.

Method: Coarse stage: Human-in-the-Loop Bootstrapping with SAM for robust segmentation. Fine stage: Convert 2D segmentation to 1D regression by sampling groove-normal profiles and refining with lightweight MLP.

Result: Outperforms previous approaches in segmentation accuracy and metrology precision while requiring less supervision.

Conclusion: LithoSeg offers promising prospects for real-world semiconductor manufacturing applications with improved precision and reduced supervision requirements.

Abstract: Accurate segmentation and measurement of lithography scanning electron microscope (SEM) images are crucial for ensuring precise process control, optimizing device performance, and advancing semiconductor manufacturing yield. Lithography segmentation requires pixel-level delineation of groove contours and consistent performance across diverse pattern geometries and process window. However, existing methods often lack the necessary precision and robustness, limiting their practical applicability. To overcome this challenge, we propose LithoSeg, a coarse-to-fine network tailored for lithography segmentation. In the coarse stage, we introduce a Human-in-the-Loop Bootstrapping scheme for the Segment Anything Model (SAM) to attain robustness with minimal supervision. In the subsequent fine stage, we recast 2D segmentation as 1D regression problem by sampling groove-normal profiles using the coarse mask and performing point-wise refinement with a lightweight MLP. LithoSeg outperforms previous approaches in both segmentation accuracy and metrology precision while requiring less supervision, offering promising prospects for real-world applications.

[208] Uncertainty-Guided Selective Adaptation Enables Cross-Platform Predictive Fluorescence Microscopy

Kai-Wen K. Yang, Andrew Bai, Alexandra Bermudez, Yunqi Hong, Zoe Latham, Iris Sloan, Michael Liu, Vishrut Goyal, Cho-Jui Hsieh, Neil Y. C. Lin

Main category: cs.CV

TL;DR: SIT-ADDA-Auto: A self-configuring domain adaptation framework that adapts only early convolutional layers while freezing deeper layers, enabling robust microscopy image transfer across instruments and settings without target labels.

DetailsMotivation: Deep learning models often fail when applied to microscopy images from new instruments or acquisition settings, and conventional domain adaptation methods disrupt learned semantic representations by retraining entire networks.

Method: SIT-ADDA-Auto integrates shallow-layer adversarial alignment with predictive uncertainty to automatically select adaptation depth, adapting only the earliest convolutional layers while freezing deeper layers without requiring target labels.

Result: The method improves reconstruction and downstream segmentation across exposure/illumination shifts, cross-instrument transfer, and multiple stains, outperforming full-encoder adaptation and non-adversarial baselines with reduced semantic feature drift.

Conclusion: Provides a design rule for label-free adaptation in microscopy and a practical recipe for field settings, demonstrating that adapting only early layers yields reliable transfer while preserving semantic representations.

Abstract: Deep learning is transforming microscopy, yet models often fail when applied to images from new instruments or acquisition settings. Conventional adversarial domain adaptation (ADDA) retrains entire networks, often disrupting learned semantic representations. Here, we overturn this paradigm by showing that adapting only the earliest convolutional layers, while freezing deeper layers, yields reliable transfer. Building on this principle, we introduce Subnetwork Image Translation ADDA with automatic depth selection (SIT-ADDA-Auto), a self-configuring framework that integrates shallow-layer adversarial alignment with predictive uncertainty to automatically select adaptation depth without target labels. We demonstrate robustness via multi-metric evaluation, blinded expert assessment, and uncertainty-depth ablations. Across exposure and illumination shifts, cross-instrument transfer, and multiple stains, SIT-ADDA improves reconstruction and downstream segmentation over full-encoder adaptation and non-adversarial baselines, with reduced drift of semantic features. Our results provide a design rule for label-free adaptation in microscopy and a recipe for field settings; the code is publicly available.

[209] Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis

Shounak Ray Chaudhuri, Arash Jahangiri, Christopher Paolini

Main category: cs.CV

TL;DR: A multi-camera computer vision framework for real-time traffic safety assessment using Post-Encroachment Time (PET) computation at signalized intersections, achieving sub-second precision on edge devices.

DetailsMotivation: Traditional crash-based traffic safety studies are limited by data sparsity and latency, creating a need for real-time, continuous safety assessment methods.

Method: Four synchronized cameras with YOLOv11 segmentation for vehicle detection, homography transformation to bird’s-eye view, and novel pixel-level PET algorithm without fixed cells, running on NVIDIA Jetson AGX Xavier edge devices.

Result: The framework identifies high-risk regions with sub-second precision, achieves 2.68 FPS for 800x800 pixel heatmaps, and provides fine-grained hazard visualization accurate to 3.3 sq-cm.

Conclusion: Validates feasibility of decentralized vision-based PET analysis for intelligent transportation systems, offering replicable methodology for high-resolution, real-time, and scalable intersection safety evaluation.

Abstract: Traffic safety analysis at signalized intersections is vital for reducing vehicle and pedestrian collisions, yet traditional crash-based studies are limited by data sparsity and latency. This paper presents a novel multi-camera computer vision framework for real-time safety assessment through Post-Encroachment Time (PET) computation, demonstrated at the intersection of H Street and Broadway in Chula Vista, California. Four synchronized cameras provide continuous visual coverage, with each frame processed on NVIDIA Jetson AGX Xavier devices using YOLOv11 segmentation for vehicle detection. Detected vehicle polygons are transformed into a unified bird’s-eye map using homography matrices, enabling alignment across overlapping camera views. A novel pixel-level PET algorithm measures vehicle position without reliance on fixed cells, allowing fine-grained hazard visualization via dynamic heatmaps, accurate to 3.3 sq-cm. Timestamped vehicle and PET data is stored in an SQL database for long-term monitoring. Results over various time intervals demonstrate the framework’s ability to identify high-risk regions with sub-second precision and real-time throughput on edge devices, producing data for an 800 x 800 pixel logarithmic heatmap at an average of 2.68 FPS. This study validates the feasibility of decentralized vision-based PET analysis for intelligent transportation systems, offering a replicable methodology for high-resolution, real-time, and scalable intersection safety evaluation.

[210] LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen, Yang Liu, Sebastien Ourselin

Main category: cs.CV

TL;DR: LIHE is a novel framework for Weakly-Supervised Generalized Referring Expression Comprehension that handles expressions with variable numbers of referents using a two-stage approach with hybrid Euclidean-hyperbolic similarity.

DetailsMotivation: Existing WREC methods are limited by one-to-one mapping assumptions and cannot handle expressions with zero or multiple targets in realistic scenarios, creating a need for more practical generalized approaches.

Method: Two-stage LIHE framework: 1) Referential Decoupling predicts target count and decomposes complex expressions into sub-expressions, 2) Referent Grounding localizes sub-expressions using HEMix hybrid similarity module combining Euclidean proximity and hyperbolic geometry.

Result: LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM datasets, while HEMix improves IoU@0.5 by up to 2.5% on standard REC benchmarks.

Conclusion: The proposed LIHE framework successfully addresses the challenges of WGREC by handling variable referent counts and preventing semantic collapse through hybrid Euclidean-hyperbolic similarity modeling.

Abstract: Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5%. The code is available at https://anonymous.4open.science/r/LIHE.

[211] Null-Space Diffusion Distillation for Efficient Photorealistic Lensless Imaging

Jose Reinaldo Cunha Santos A V Silva Neto, Hodaka Kawachi, Yasushi Yagi, Tomoya Nakamura

Main category: cs.CV

TL;DR: NSDD distills iterative diffusion solvers for lensless imaging, achieving photorealistic reconstructions without paired supervision while being fast and memory-efficient.

DetailsMotivation: To avoid biases from paired lensless-lensed supervision and address the limitations of generic diffusion priors in noisy, ill-posed lensless deconvolution settings.

Method: Introduces Null-Space Diffusion Distillation (NSDD), a single-pass student model that distills the null-space component of iterative DDNM+ solver, conditioned on lensless measurement and range-space anchor.

Result: NSDD achieves near-teacher perceptual quality (second-best LPIPS), outperforms DPS and classical baselines, and is the second fastest method behind Wiener while preserving measurement consistency.

Conclusion: NSDD provides a practical path toward fast, ground-truth-free, photorealistic lensless imaging by separating range-space enforcement from null-space diffusion-prior updates.

Abstract: State-of-the-art photorealistic reconstructions for lensless cameras often rely on paired lensless-lensed supervision, which can bias models due to lens-lensless domain mismatch. To avoid this, ground-truth-free diffusion priors are attractive; however, generic formulations tuned for conventional inverse problems often break under the noisy, highly multiplexed, and ill-posed lensless deconvolution setting. We observe that methods which separate range-space enforcement from null-space diffusion-prior updates yield stable, realistic reconstructions. Building on this, we introduce Null-Space Diffusion Distillation (NSDD): a single-pass student that distills the null-space component of an iterative DDNM+ solver, conditioned on the lensless measurement and on a range-space anchor. NSDD preserves measurement consistency and achieves photorealistic results without paired supervision at a fraction of the runtime and memory. On Lensless-FFHQ and PhlatCam, NSDD is the second fastest, behind Wiener, and achieves near-teacher perceptual quality (second-best LPIPS, below DDNM+), outperforming DPS and classical convex baselines. These results suggest a practical path toward fast, ground-truth-free, photorealistic lensless imaging.

[212] Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

Rulin Zhou, Wenlong He, An Wang, Jianhang Zhang, Xuanhui Zeng, Xi Zhang, Chaowei Zhu, Haijun Hu, Hongliang Ren

Main category: cs.CV

TL;DR: VL-SurgPT is the first large-scale multimodal surgical tracking dataset with 908 video clips that combines visual tracking with textual descriptions of point status, enabling context-aware tracking systems for challenging surgical conditions.

DetailsMotivation: Existing surgical tracking datasets lack semantic context to understand tracking failure mechanisms in complex surgical environments with smoke occlusion, specular reflections, and tissue deformation.

Method: Created VL-SurgPT dataset with 754 tissue tracking clips (17,171 points) and 154 instrument tracking clips (7 instrument types), established benchmarks with 8 tracking methods, and proposed TG-SurgPT - a text-guided approach that leverages semantic descriptions.

Result: Incorporating point status information significantly improves tracking accuracy and reliability, especially in adverse visual scenarios where conventional vision-only methods struggle.

Conclusion: Bridging visual and linguistic modalities enables development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that maintain performance under challenging intraoperative conditions.

Abstract: Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic descriptions to improve robustness in visually challenging conditions. Experimental results demonstrate that incorporating point status information significantly improves tracking accuracy and reliability, particularly in adverse visual scenarios where conventional vision-only methods struggle. By bridging visual and linguistic modalities, VL-SurgPT enables the development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that can maintain performance even under challenging intraoperative conditions.

[213] GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Jeong Hun Yeo, Sangyun Chung, Sungjune Park, Dae Hoe Kim, Jinyoung Moon, Yong Man Ro

Main category: cs.CV

TL;DR: GCAgent is a novel framework for long-video understanding that uses Schematic and Narrative Episodic Memory to model events and their relationships, solving long-term dependency problems and achieving state-of-the-art performance on Video-MME benchmark.

DetailsMotivation: Existing MLLMs struggle with long-video understanding due to token limitations and inability to capture long-term temporal dependencies and global context for deep video reasoning.

Method: Proposes GCAgent framework with Schematic and Narrative Episodic Memory that structurally models events and their causal/temporal relations. Operates in Perception-Action-Reflection cycle using Memory Manager for context-aware inference.

Result: Achieves up to 23.5% accuracy improvement on Video-MME Long split over baseline, 73.4% accuracy on Long split, and highest overall average (71.9%) on Video-MME benchmark among 7B-scale MLLMs.

Conclusion: The agent-based reasoning paradigm with structured memory effectively enables cognitively-inspired long-video understanding, establishing new state-of-the-art performance.

Abstract: Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

[214] VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

Main category: cs.CV

TL;DR: A novel framework that jointly integrates visual and physical cues for 3D hand-object pose estimation from single RGB images, overcoming limitations of visual-only methods and non-differentiable physics engines.

DetailsMotivation: Existing methods rely on visual cues alone, producing physically implausible results with interpenetration or non-contact issues. Current physics integration approaches compromise visual consistency and end-to-end trainability.

Method: Two key components: 1) Joint visual-physical cue learning for comprehensive representation learning, 2) Candidate pose aggregation that refines multiple diffusion-generated poses using both visual and physical predictions.

Result: Significantly outperforms state-of-the-art approaches in both pose accuracy and physical plausibility through extensive experiments.

Conclusion: The proposed framework successfully achieves visually consistent and physically plausible hand-object pose estimation by integrating visual and physical cues in an end-to-end trainable manner.

Abstract: Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

[215] Improved Masked Image Generation with Knowledge-Augmented Token Representations

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, Yunming Ye

Main category: cs.CV

TL;DR: KA-MIG enhances masked image generation by incorporating explicit token-level semantic dependency knowledge from training data through three types of knowledge graphs, improving generation quality.

DetailsMotivation: Existing MIG methods struggle to learn semantic dependencies from visual tokens because individual tokens lack clear semantic meanings and sequences are long, limiting performance.

Method: Proposes KA-MIG framework with three knowledge graphs (co-occurrence, semantic similarity, position-token incompatibility), a graph-aware encoder for token/position representations, and lightweight fusion with existing MIG methods.

Result: Experimental results show improved performance for class-conditional image generation on ImageNet compared to existing MIG methods.

Conclusion: Incorporating explicit token-level semantic dependency knowledge as priors effectively enhances the model’s ability to capture semantic dependencies and improves generation quality in masked image generation.

Abstract: Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model’s ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.

[216] SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

Xinyuan Hu, Changyue Shi, Chuxiao Yang, Minghao Chen, Jiajun Ding, Tao Wei, Chen Wei, Zhou Yu, Min Tan

Main category: cs.CV

TL;DR: SRSplat is a feed-forward framework that reconstructs high-resolution 3D scenes from sparse, low-resolution images by leveraging external reference images and internal texture cues to compensate for missing high-frequency details.

DetailsMotivation: Existing methods fail to recover fine texture details from sparse, low-resolution images due to inherent lack of high-frequency information, limiting their practical applications in autonomous driving and embodied AI.

Method: Uses MLLMs and diffusion models to generate scene-specific reference gallery, integrates external information via Reference-Guided Feature Enhancement module, trains decoder to predict Gaussian primitives, and refines them with Texture-Aware Density Control based on internal texture richness.

Result: Outperforms existing methods on RealEstate10K, ACID, and DTU datasets, demonstrating strong cross-dataset and cross-resolution generalization capabilities.

Conclusion: SRSplat effectively addresses the texture detail recovery problem in 3D reconstruction from sparse LR images through joint external reference and internal texture optimization.

Abstract: Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

[217] FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification

Cheng-Chang Tsai, Kai-Wen Cheng, Chun-Shien Lu

Main category: cs.CV

TL;DR: FedSDA addresses non-IID data in federated learning for histopathological images by aligning stain distributions across clients using diffusion models, improving model performance while maintaining privacy.

DetailsMotivation: Non-IID data poses a major challenge in federated learning, especially for histopathological images with feature distribution shifts. Existing approaches have limited attention on addressing this from a data distribution perspective.

Method: Proposed Federated Stain Distribution Alignment (FedSDA) that uses diffusion models and stain separation to align client stain distributions with a target distribution, while circumventing privacy risks by avoiding training diffusion models on raw data.

Result: FedSDA effectively improves baseline methods that focus on mitigating model update disparities and outperforms other non-IID data distribution approaches. Extensive experiments demonstrate its effectiveness.

Conclusion: FedSDA provides valuable practical insights for computational pathology by successfully addressing non-IID data issues through stain distribution alignment in federated learning frameworks.

Abstract: Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature distribution shifts from an intuitive perspective that has only received limited attention. Specifically, we address this issue from the perspective of data distribution by solely adjusting the data distributions of all clients. Building on the success of diffusion models in fitting data distributions and leveraging stain separation to extract the pivotal features that are closely related to the non-IID properties of histopathological images, we propose a Federated Stain Distribution Alignment (FedSDA) method. FedSDA aligns the stain distribution of each client with a target distribution in an FL framework to mitigate distribution shifts among clients. Furthermore, considering that training diffusion models on raw data in FL has been shown to be susceptible to privacy leakage risks, we circumvent this problem while still effectively achieving alignment. Extensive experimental results show that FedSDA is not only effective in improving baselines that focus on mitigating disparities across clients’ model updates but also outperforms baselines that address the non-IID data issues from the perspective of data distribution. We show that FedSDA provides valuable and practical insights for the computational pathology community.

[218] DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging

Huimin Cheng, Xiaowei Yu, Shushan Wu, Luyang Fang, Chao Cao, Jing Zhang, Tianming Liu, Dajiang Zhu, Wenxuan Zhong, Ping Ma

Main category: cs.CV

TL;DR: DCMM-Transformer is a novel Vision Transformer architecture for medical images that incorporates anatomical groupings through a differentiable Degree-Corrected Mixed-Membership model as an additive bias in self-attention, overcoming limitations of previous methods.

DetailsMotivation: Standard Vision Transformers fail to exploit latent anatomical groupings in medical images (organs, tissues, pathological regions), while existing approaches like SBM-Transformer suffer from non-differentiability, training instability, and inability to model complex community structure.

Method: Propose DCMM-Transformer that uses a Degree-Corrected Mixed-Membership model as an additive bias in self-attention, enabling differentiable and interpretable incorporation of community structure and degree heterogeneity without binary sampling or multiplicative masking.

Result: Comprehensive experiments across diverse medical imaging datasets (brain, chest, breast, ocular) demonstrate superior performance and generalizability. Learned group structure produces anatomically meaningful and semantically coherent attention maps.

Conclusion: DCMM-Transformer effectively incorporates anatomical structure in medical image analysis through a differentiable approach, achieving better performance while enhancing interpretability with structured attention modulation.

Abstract: Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.

[219] JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

Main category: cs.CV

TL;DR: JAFAR is a lightweight attention-based feature upsampler that enhances spatial resolution of vision encoder features to arbitrary target resolutions using Spatial Feature Transform modulation, showing strong generalization from low to high scales without high-resolution supervision.

DetailsMotivation: Foundation Vision Encoders produce low-resolution spatial features that need upsampling for high-resolution downstream tasks, but existing methods lack flexibility and performance.

Method: Uses attention-based module with Spatial Feature Transform (SFT) modulation to align high-resolution queries from low-level features with semantically enriched low-resolution keys, trained at low upsampling ratios.

Result: Effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across diverse downstream tasks.

Conclusion: JAFAR provides a flexible and effective solution for feature upsampling that generalizes well from low to high scales without requiring high-resolution supervision.

Abstract: Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

[220] DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

Saksham Kumar, Ashish Singh, Srinivasarao Thota, Sunil Kumar Singh, Chandan Kumar

Main category: cs.CV

TL;DR: DeiTFake: A DeiT-based transformer model with two-stage progressive training that achieves 99.22% accuracy for deepfake detection using knowledge distillation and specialized augmentations.

DetailsMotivation: Deepfakes pose major threats to digital media integrity, requiring robust detection methods to identify manipulated content.

Method: Two-stage progressive training: initial transfer learning with standard augmentations followed by fine-tuning with advanced affine and deepfake-specific augmentations, using DeiT’s knowledge distillation to capture subtle manipulation artifacts.

Result: Achieved 98.71% accuracy after stage one and 99.22% accuracy with AUROC of 0.9997 after stage two, outperforming latest OpenForensics baselines on 190,335 images.

Conclusion: The approach provides practical benchmarks for facial deepfake detection and demonstrates the effectiveness of progressive training with specialized augmentations.

Abstract: Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT’s knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71% accuracy after stage one and 99.22% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.

[221] UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization

Cuiqun Chen, Qi Chen, Bin Yang, Xingyi Zhang

Main category: cs.CV

TL;DR: UniABG is a dual-stage unsupervised cross-view geo-localization framework that uses adversarial view bridging and graph-based calibration to achieve state-of-the-art performance without pairwise annotations.

DetailsMotivation: Supervised CVGL methods require extensive pairwise annotations which limit scalability, while unsupervised methods suffer from noisy pseudo-labels due to cross-view domain gaps.

Method: Two-stage approach: 1) View-Aware Adversarial Bridging (VAAB) for view-invariant features and robust pseudo-labels, 2) Heterogeneous Graph Filtering Calibration (HGFC) with dual inter-view structure graphs for reliable view correspondence.

Result: Achieves state-of-the-art unsupervised performance: +10.63% AP on University-1652 and +16.73% AP on SUES-200 for Satellite→Drone matching, even surpassing supervised baselines.

Conclusion: UniABG effectively addresses unsupervised CVGL challenges through adversarial bridging and graph-based calibration, demonstrating strong performance without annotation requirements.

Abstract: Cross-view geo-localization (CVGL) matches query images ($\textit{e.g.}$, drone) to geographically corresponding opposite-view imagery ($\textit{e.g.}$, satellite). While supervised methods achieve strong performance, their reliance on extensive pairwise annotations limits scalability. Unsupervised alternatives avoid annotation costs but suffer from noisy pseudo-labels due to intrinsic cross-view domain gaps. To address these limitations, we propose $\textit{UniABG}$, a novel dual-stage unsupervised cross-view geo-localization framework integrating adversarial view bridging with graph-based correspondence calibration. Our approach first employs View-Aware Adversarial Bridging (VAAB) to model view-invariant features and enhance pseudo-label robustness. Subsequently, Heterogeneous Graph Filtering Calibration (HGFC) refines cross-view associations by constructing dual inter-view structure graphs, achieving reliable view correspondence. Extensive experiments demonstrate state-of-the-art unsupervised performance, showing that UniABG improves Satellite $\rightarrow$ Drone AP by +10.63% on University-1652 and +16.73% on SUES-200, even surpassing supervised baselines. The source code is available at https://github.com/chenqi142/UniABG

[222] PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Sijie Wang, Qiang Wang, Shaohuai Shi

Main category: cs.CV

TL;DR: PipeDiT is a pipelining framework that accelerates video generation by optimizing GPU utilization and reducing memory consumption through sequence parallelism, decoupled VAE processing, and attention co-processing.

DetailsMotivation: Current diffusion transformer (DiT) based video generation models suffer from slow inference speeds and high memory consumption, limiting their practical deployment despite their remarkable capabilities.

Method: Three main innovations: 1) PipeSP for pipelined sequence parallelism across GPUs, 2) DeDiVAE to decouple diffusion and VAE modules into separate GPU groups for pipelined execution, 3) Attention co-processing (Aco) to better utilize VAE group GPU resources.

Result: Integrated into OpenSoraPlan and HunyuanVideo frameworks, PipeDiT achieves 1.06x to 4.02x speedups over baseline systems across various resolution and timestep configurations on 8-GPU systems.

Conclusion: PipeDiT effectively addresses the inference latency and memory consumption challenges in DiT-based video generation, enabling more practical deployment of state-of-the-art video generation models.

Abstract: Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

[223] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity

Zhichen Lai, Hua Lu, Huan Li, Jialiang Li, Christian S. Jensen

Main category: cs.CV

TL;DR: MovSemCL is a movement-semantics contrastive learning framework for trajectory similarity computation that addresses limitations in existing methods by modeling trajectory semantics and hierarchy efficiently, reducing computational costs, and using physically plausible augmentations.

DetailsMotivation: Existing learning-based trajectory similarity methods have three key limitations: insufficient modeling of trajectory semantics and hierarchy, high computational costs from point-wise encoding, and use of physically implausible augmentations that distort trajectory semantics.

Method: Transforms GPS trajectories into movement-semantics features, segments them into patches, uses intra- and inter-patch attentions to encode local and global patterns, and employs curvature-guided augmentation that preserves informative segments while masking redundant ones.

Result: Outperforms state-of-the-art methods with mean ranks close to ideal value of 1 at similarity search tasks, improvements up to 20.3% at heuristic approximation, and reduces inference latency by up to 43.4%.

Conclusion: MovSemCL effectively addresses key limitations in trajectory similarity computation through hierarchical representation, efficient encoding, and physically plausible augmentations, achieving superior performance with reduced computational costs.

Abstract: Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSemCL, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSemCL first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSemCL employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSemCL includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSemCL is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

[224] Learning to Hear by Seeing: It’s Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

Dengming Zhang, Weitao You, Jingxiong Li, Weishen Lin, Wenda Shi, Xue Zhao, Heda Zuo, Junxian Wu, Lingyun Sun

Main category: cs.CV

TL;DR: VAEmotionLLM is a two-stage framework that teaches visual language models to hear with limited audio pretraining and understand emotion across modalities, achieving SOTA on emotion understanding benchmarks.

DetailsMotivation: Current AVLMs require large-scale audio pretraining and overlook emotion intentionally expressed in artworks, while emotion understanding is critical for making LLMs more human-aligned.

Method: Two-stage framework: 1) VG-Align distills visual pathway into audio pathway by aligning token distributions, 2) EmoAdapter with Emotion Enhancer and Supervisor injects emotion-sensitive residuals and applies emotion supervision.

Result: Achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show proposed components are complementary.

Conclusion: The framework enables efficient audio-visual emotion understanding with limited audio pretraining, demonstrating effectiveness for cross-modal emotion comprehension in artistic contexts.

Abstract: Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.

[225] Point Cloud Quantization through Multimodal Prompting for 3D Understanding

Hongxuan Li, Wencheng Zhu, Huiying Xu, Xinzhong Zhu, Pengfei Zhu

Main category: cs.CV

TL;DR: A multimodal prompting-driven quantization framework for point cloud analysis that uses text embeddings as robust prototype priors and adaptively refines them through multimodal prompts.

DetailsMotivation: Current prototype-based quantization methods lack representativeness and interpretability, while multimodal alignment shows promise but needs better integration for point cloud analysis.

Method: Uses text embeddings as prototype priors, multimodal prompts for adaptive refinement, dual-constrained quantization space with compactness/separation regularization, and Gumbel-Softmax for differentiable discretization.

Result: Superior performance demonstrated on ModelNet40 and ScanObjectNN datasets, showing effectiveness in encoding both geometric and semantic information.

Conclusion: The proposed framework successfully addresses limitations of current quantization methods by leveraging multimodal alignment and adaptive prompting, achieving robust hybrid representations for point cloud analysis.

Abstract: Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.

[226] Supervised Multilabel Image Classification Using Residual Networks with Probabilistic Reasoning

Lokender Singh, Saksham Kumar, Chandan Kumar

Main category: cs.CV

TL;DR: Novel multilabel image classification method using modified ResNet-101 with probabilistic reasoning, achieving state-of-the-art 0.794 mAP on COCO-2014 dataset.

DetailsMotivation: Address challenges in multilabel image categorization by modeling label dependencies and uncertainties, which are common in real-world computer vision applications.

Method: Modified ResNet-101 architecture integrated with probabilistic reasoning to simulate label dependencies and uncertainties in multilabel scenarios.

Result: Achieved 0.794 mAP on COCO-2014, outperforming ResNet-SRN (0.771) and Vision Transformer baselines (0.785), with improved precision-recall scores.

Conclusion: Integrating probabilistic reasoning into deep learning models effectively addresses multilabel classification challenges and achieves state-of-the-art performance.

Abstract: Multilabel image categorization has drawn interest recently because of its numerous computer vision applications. The proposed work introduces a novel method for classifying multilabel images using the COCO-2014 dataset and a modified ResNet-101 architecture. By simulating label dependencies and uncertainties, the approach uses probabilistic reasoning to improve prediction accuracy. Extensive tests show that the model outperforms earlier techniques and approaches to state-of-the-art outcomes in multilabel categorization. The work also thoroughly assesses the model’s performance using metrics like precision-recall score and achieves 0.794 mAP on COCO-2014, outperforming ResNet-SRN (0.771) and Vision Transformer baselines (0.785). The novelty of the work lies in integrating probabilistic reasoning into deep learning models to effectively address the challenges presented by multilabel scenarios.

[227] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving

Ji-Ping Jin, Chen-Bin Feng, Rui Fan, Chi-Man Vong

Main category: cs.CV

TL;DR: SemanticStitch is a deep learning framework that uses semantic priors to preserve foreground object integrity in image stitching, overcoming traditional methods’ limitations with a novel loss function and specialized datasets.

DetailsMotivation: Traditional image stitching methods struggle with misalignments and visual discrepancies due to varying capture angles, positions, and object movements, and they neglect semantic information which disrupts foreground continuity.

Method: Introduces SemanticStitch framework with semantic priors for foreground objects and a novel loss function that emphasizes semantic integrity of salient objects to enhance stitching quality.

Result: Experimental results show substantial improvements over traditional techniques, with the method providing robust support for practical applications as validated on two specialized real-world datasets.

Conclusion: SemanticStitch effectively addresses semantic disruption issues in image stitching by incorporating semantic awareness, significantly improving visual coherence and foreground object preservation compared to traditional approaches.

Abstract: Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method’s effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.

[228] Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning

Shengqin Jiang, Tianqi Kong, Yuankai Qi, Haokui Zhang, Lina Yao, Quan Z. Sheng, Qingshan Liu, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Proposes hierarchical layer-grouped prompt tuning for continual learning to reduce catastrophic forgetting by grouping layers and using a shared root prompt to generate sub-prompts.

DetailsMotivation: Existing prompt-based continual learning methods attach independent task-specific prompts to each layer, which provides flexibility but makes layers susceptible to unnecessary updates and increases catastrophic forgetting risk when prompts are aggregated across tasks.

Method: Hierarchical layer-grouped prompt tuning where: (i) layers in same group share similar prompts adjusted by position encoding to preserve pre-trained model relationships, (ii) single task-specific root prompt generates sub-prompts for each layer group to enhance synergy and reduce independence.

Result: Extensive experiments across four benchmarks show favorable performance compared to several state-of-the-art continual learning methods.

Conclusion: The proposed hierarchical layer-grouped prompt tuning method effectively improves model stability and reduces catastrophic forgetting in continual learning scenarios.

Abstract: Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model’s parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer’s representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.

[229] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillatio

Shuhan Ye, Yi Yu, Qixin Zhang, Chenqi Kong, Qiangqiang Wu, Kun Wang, Xudong Jiang

Main category: cs.CV

TL;DR: PACE is a dataset distillation framework for spiking neural networks (SNNs) and event-based vision that compresses large training datasets into compact synthetic ones, enabling fast SNN training with significant reductions in training time and storage costs.

DetailsMotivation: SNNs are energy-efficient alternatives for event-based vision systems but remain costly to train due to temporal coding, limiting their practical deployment. PACE aims to alleviate this high training cost.

Method: PACE uses two core modules: ST-DSM (which densifies spike-based features and performs spatiotemporal matching of amplitude and phase) and PEQ-N (a plug-and-play probabilistic integer quantizer compatible with standard event-frame pipelines).

Result: PACE outperforms existing methods on DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, achieving 84.4% accuracy on N-MNIST (85% of full training set performance) while reducing training time by 50× and storage cost by 6000×.

Conclusion: PACE enables minute-scale SNN training and efficient edge deployment by creating compact dataset surrogates that maintain high performance while dramatically reducing computational and storage requirements.

Abstract: Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves (84.4%) accuracy, about (85%) of the full training set performance, while reducing training time by more than (50\times) and storage cost by (6000\times), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.

[230] Sparse by Rule: Probability-Based N:M Pruning for Spiking Neural Networks

Shuhan Ye, Yi Yu, Qixin Zhang, Chenqi Kong, Qiangqiang Wu, Xudong Jiang, Dacheng Tao

Main category: cs.CV

TL;DR: SpikeNM is the first semi-structured N:M pruning framework for SNNs that enables hardware-friendly sparse patterns while maintaining accuracy through novel parameterization and neuroscience-inspired distillation.

DetailsMotivation: Existing SNN pruning methods face a trade-off: unstructured pruning achieves high sparsity but is hardware-unfriendly, while structured pruning is deployable but lacks flexibility and often degrades accuracy. There's a need for a middle ground that combines hardware efficiency with accuracy preservation.

Method: SpikeNM uses semi-structured N:M pruning with M-way basis-logit parameterization and differentiable top-k sampler to linearize complexity. It also employs eligibility-inspired distillation (EID) to align mask probabilities with spiking dynamics using temporally accumulated credits as soft targets.

Result: At 2:4 sparsity, SpikeNM maintains and even improves accuracy across mainstream datasets while producing hardware-amenable sparse patterns that complement intrinsic spike sparsity.

Conclusion: SpikeNM successfully bridges the gap between unstructured and structured pruning for SNNs, providing a hardware-friendly semi-structured approach that preserves accuracy and enables more aggressive sparsification through innovative parameterization and neuroscience-inspired techniques.

Abstract: Brain-inspired Spiking neural networks (SNNs) promise energy-efficient intelligence via event-driven, sparse computation, but deeper architectures inflate parameters and computational cost, hindering their edge deployment. Recent progress in SNN pruning helps alleviate this burden, yet existing efforts fall into only two families: \emph{unstructured} pruning, which attains high sparsity but is difficult to accelerate on general hardware, and \emph{structured} pruning, which eases deployment but lack flexibility and often degrades accuracy at matched sparsity. In this work, we introduce \textbf{SpikeNM}, the first SNN-oriented \emph{semi-structured} (N{:}M) pruning framework that learns sparse SNNs \emph{from scratch}, enforcing \emph{at most (N)} non-zeros per (M)-weight block. To avoid the combinatorial space complexity (\sum_{k=1}^{N}\binom{M}{k}) growing exponentially with (M), SpikeNM adopts an (M)-way basis-logit parameterization with a differentiable top-(k) sampler, \emph{linearizing} per-block complexity to (\mathcal O(M)) and enabling more aggressive sparsification. Further inspired by neuroscience, we propose \emph{eligibility-inspired distillation} (EID), which converts temporally accumulated credits into block-wise soft targets to align mask probabilities with spiking dynamics, reducing sampling variance and stabilizing search under high sparsity. Experiments show that at (2{:}4) sparsity, SpikeNM maintains and even with gains across main-stream datasets, while yielding hardware-amenable patterns that complement intrinsic spike sparsity.

[231] DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Xianhao Zhou, Jianghao Wu, Ku Zhao, Jinlong He, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang

Main category: cs.CV

TL;DR: Proposed DGCF framework combines frozen DINOv3 Transformer with trainable CNN for CT synthesis from CBCT/MRI, achieving SOTA performance via hierarchical fusion and semantic-aware losses.

DetailsMotivation: Existing CNN models lack global semantic understanding, while Transformers overfit small medical datasets. Need balanced approach for medical image translation.

Method: DINOv3-Guided Cross Fusion (DGCF) integrates frozen self-supervised DINOv3 Transformer with trainable CNN encoder-decoder, using cross fusion module and Multi-Level DINOv3 Perceptual loss.

Result: Achieved state-of-the-art performance on SynthRAD2023 pelvic dataset for MRI→CT and CBCT→CT translation tasks in terms of MS-SSIM, PSNR and segmentation-based metrics.

Conclusion: First work to employ DINOv3 representations for medical image translation, demonstrating potential of self-supervised Transformer guidance for semantic-aware CT synthesis.

Abstract: Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3’s feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$\rightarrow$CT and CBCT$\rightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.

[232] Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

Tianle Cheng, Zeyan Zhang, Kaifeng Gao, Jun Xiao

Main category: cs.CV

TL;DR: Proposes Adaptive Begin-of-Video Tokens (ada-BOV) and refinement strategy for autoregressive video diffusion models to improve long video generation with better consistency and motion dynamics.

DetailsMotivation: Current autoregressive video diffusion models struggle with fragile consistency and poor motion dynamics when generating long videos, either suffering from denoising latency/error accumulation (chunk-based) or consistency issues (stream denoising).

Method: Introduces ada-BOV tokens that adaptively absorb denoised preceding frames via adaptive layer norm modulation, plus a refinement strategy that decouples sampling trajectory length from attention window constraints, and a disturbance-augmented training noise schedule.

Result: Achieves compelling qualitative and quantitative results across multiple metrics, demonstrating improved global consistency and local dynamics in long video generation.

Conclusion: The proposed ada-BOV tokens and refinement strategy effectively address limitations in existing autoregressive video diffusion models, enabling high-quality long video generation with better consistency and motion dynamics.

Abstract: Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.

[233] Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation

Yannan Chen, Ruoyu Chen, Bin Zeng, Wei Wang, Shiming Liu, Qunli Zhang, Zheng Hu, Laiyuan Wang, Yaowei Wang, Xiaochun Cao

Main category: cs.CV

TL;DR: SS-CA integrates counterfactual explanations into training to address models’ incomplete causal learning by using attribution methods to identify critical regions, replacing them with natural background, and training on both original and augmented samples.

DetailsMotivation: Models often rely on limited sufficient causes for predictions, making them sensitive to distribution shifts. While attribution methods identify critical regions, masking these areas causes model misclassification while humans still recognize the target, revealing insufficient causal dependencies.

Method: Propose Subset-Selected Counterfactual Augmentation (SS-CA) using Counterfactual LIMA to identify minimal spatial region sets whose removal alters predictions. Replace identified regions with natural background and train jointly on augmented and original samples.

Result: Extensive experiments on ImageNet variants show SS-CA improves in-distribution generalization and achieves superior performance on out-of-distribution benchmarks (ImageNet-R, ImageNet-S). Models also exhibit enhanced generalization under noise perturbations.

Conclusion: SS-CA effectively uses interpretability insights to correct model deficiencies, improving both performance and robustness by mitigating incomplete causal learning through targeted counterfactual augmentation.

Abstract: In current visual model training, models often rely on only limited sufficient causes for their predictions, which makes them sensitive to distribution shifts or the absence of key features. Attribution methods can accurately identify a model’s critical regions. However, masking these areas to create counterfactuals often causes the model to misclassify the target, while humans can still easily recognize it. This divergence highlights that the model’s learned dependencies may not be sufficiently causal. To address this issue, we propose Subset-Selected Counterfactual Augmentation (SS-CA), which integrates counterfactual explanations directly into the training process for targeted intervention. Building on the subset-selection-based LIMA attribution method, we develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions. Leveraging these attributions, we introduce a data augmentation strategy that replaces the identified regions with natural background, and we train the model jointly on both augmented and original samples to mitigate incomplete causal learning. Extensive experiments across multiple ImageNet variants show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks such as ImageNet-R and ImageNet-S. Under perturbations including noise, models trained with SS-CA also exhibit enhanced generalization, demonstrating that our approach effectively uses interpretability insights to correct model deficiencies and improve both performance and robustness.

[234] BdSL-SPOTER: A Transformer-Based Framework for Bengali Sign Language Recognition with Cultural Adaptation

Sayad Ibna Azad, Md. Atiqur Rahman

Main category: cs.CV

TL;DR: BdSL-SPOTER is a pose-based transformer framework for Bengali Sign Language recognition that achieves 97.92% accuracy with improved efficiency over baseline models.

DetailsMotivation: To develop an accurate and efficient recognition system for Bengali Sign Language (BdSL) that addresses the challenges of limited data and computational constraints in low-resource regional sign languages.

Method: Extends SPOTER paradigm with cultural-specific preprocessing, uses a compact four-layer transformer encoder with optimized learnable positional encodings, and employs curriculum learning for better generalization and faster convergence.

Result: Achieved 97.92% Top-1 validation accuracy on BdSLW60 benchmark, representing 22.82% improvement over Bi-LSTM baseline, with reduced parameters, lower FLOPs, and higher FPS.

Conclusion: BdSL-SPOTER provides a practical framework for real-world accessibility applications and serves as a scalable model for other low-resource regional sign languages.

Abstract: We introduce BdSL-SPOTER, a pose-based transformer framework for accurate and efficient recognition of Bengali Sign Language (BdSL). BdSL-SPOTER extends the SPOTER paradigm with cultural specific preprocessing and a compact four-layer transformer encoder featuring optimized learnable positional encodings, while employing curriculum learning to enhance generalization on limited data and accelerate convergence. On the BdSLW60 benchmark, it achieves 97.92% Top-1 validation accuracy, representing a 22.82% improvement over the Bi-LSTM baseline, all while keeping computational costs low. With its reduced number of parameters, lower FLOPs, and higher FPS, BdSL-SPOTER provides a practical framework for real-world accessibility applications and serves as a scalable model for other low-resource regional sign languages.

[235] TEMPO: Global Temporal Building Density and Height Estimation from Satellite Imagery

Tammy Glazer, Gilles Q. Hacheme, Akram Zaytar, Luana Marotti, Amy Michaels, Girmaw Abebe Tadesse, Kevin White, Rahul Dodhia, Andrew Zolli, Inbal Becker-Reshef, Juan M. Lavista Ferres, Caleb Robinson

Main category: cs.CV

TL;DR: TEMPO is a global dataset providing quarterly building density and height maps from 2018-2025 using deep learning on satellite imagery, achieving high accuracy and temporal stability.

DetailsMotivation: To enable large-scale monitoring of development patterns and climate impacts by creating temporally resolved global building data at low computational cost.

Method: Multi-task deep learning model trained on building footprint/height data paired with quarterly PlanetScope satellite images, predicting at 37.6-meter resolution.

Result: Achieves 85-88% F1 score on validation, 0.96 trend-consistency score, and captures quarterly settlement changes efficiently.

Conclusion: TEMPO provides cost-effective, temporally stable building monitoring essential for global resilience and adaptation efforts.

Abstract: We present TEMPO, a global, temporally resolved dataset of building density and height derived from high-resolution satellite imagery using deep learning models. We pair building footprint and height data from existing datasets with quarterly PlanetScope basemap satellite images to train a multi-task deep learning model that predicts building density and building height at a 37.6-meter per pixel resolution. We apply this model to global PlanetScope basemaps from Q1 2018 through Q2 2025 to create global, temporal maps of building density and height. We validate these maps by comparing against existing building footprint datasets. Our estimates achieve an F1 score between 85% and 88% on different hand-labeled subsets, and are temporally stable, with a 0.96 five-year trend-consistency score. TEMPO captures quarterly changes in built settlements at a fraction of the computational cost of comparable approaches, unlocking large-scale monitoring of development patterns and climate impacts essential for global resilience and adaptation efforts.

[236] Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection

Tianxiang Zhang, Peipeng Yu, Zhihua Xia, Longchen Dai, Xiaoyu Zhou, Hui Gao

Main category: cs.CV

TL;DR: DFF-Adapter enhances DINOv2 for deepfake detection by using lightweight multi-head LoRA modules for fine-grained manipulation type classification alongside authenticity detection, achieving state-of-the-art performance with only 3.5M parameters.

DetailsMotivation: Existing deepfake detection methods treat detection as generic binary classification, overlooking distinct artifacts from different manipulation methods, which limits detection effectiveness.

Method: Proposes DFF-Adapter with lightweight multi-head LoRA modules integrated into every transformer block of DINOv2, enabling multi-task learning for both authenticity detection and fine-grained manipulation type classification with shared knowledge propagation.

Result: Achieves detection accuracy comparable to or surpassing current state-of-the-art methods while using only 3.5M trainable parameters, demonstrating parameter efficiency.

Conclusion: The proposed DFF-Adapter effectively addresses deepfake detection by leveraging fine-grained manipulation classification to enhance artifact sensitivity, providing an efficient and powerful solution for information integrity protection.

Abstract: The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

[237] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu

Main category: cs.CV

TL;DR: MEMR-Seg introduces multi-round entity-level reasoning for medical image segmentation, addressing limitations of single-round dialogue methods. The approach includes a new dataset (MR-MedSeg) and baseline model (MediRound) with a Judgment & Correction Mechanism to handle error propagation.

DetailsMotivation: Existing medical segmentation methods are task-specific and lack interactivity. While text-prompt approaches enable user-driven segmentation, they are limited to single-round dialogues and cannot perform multi-round reasoning, which is crucial for complex medical scenarios.

Method: Proposed MEMR-Seg task for multi-round entity-level reasoning segmentation. Built MR-MedSeg dataset with 177K multi-round medical segmentation dialogues. Developed MediRound baseline model with a lightweight Judgment & Correction Mechanism to mitigate error propagation in the chain-like multi-round pipeline.

Result: Experimental results show the method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

Conclusion: MEMR-Seg enables interactive multi-round reasoning for medical image segmentation, providing a more flexible and reasoning-based approach compared to existing single-round methods.

Abstract: Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

[238] RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving

Ruiqi Cheng, Huijun Di, Jian Li, Feng Liu, Wei Liang

Main category: cs.CV

TL;DR: RadarMP is a unified method for joint radar target detection and 3D scene flow estimation using low-level radar signals, achieving reliable motion perception across various weather conditions without explicit annotations.

DetailsMotivation: 4D mmWave radar provides all-weather operation but suffers from sparse and noisy data, limiting motion perception capabilities when optical sensors fail in adverse weather conditions.

Method: Jointly models radar target detection and motion estimation in unified architecture using self-supervised loss functions guided by Doppler shifts and echo intensity for spatial and motion consistency.

Result: Outperforms radar-based decoupled motion perception pipelines and achieves reliable motion perception across diverse weather and illumination conditions on public datasets.

Conclusion: RadarMP enhances perception capabilities for full-scenario autonomous driving systems by providing precise 3D scene motion perception in challenging weather conditions.

Abstract: Accurate 3D scene motion perception significantly enhances the safety and reliability of an autonomous driving system. Benefiting from its all-weather operational capability and unique perceptual properties, 4D mmWave radar has emerged as an essential component in advanced autonomous driving. However, sparse and noisy radar points often lead to imprecise motion perception, leaving autonomous vehicles with limited sensing capabilities when optical sensors degrade under adverse weather conditions. In this paper, we propose RadarMP, a novel method for precise 3D scene motion perception using low-level radar echo signals from two consecutive frames. Unlike existing methods that separate radar target detection and motion estimation, RadarMP jointly models both tasks in a unified architecture, enabling consistent radar point cloud generation and pointwise 3D scene flow prediction. Tailored to radar characteristics, we design specialized self-supervised loss functions guided by Doppler shifts and echo intensity, effectively supervising spatial and motion consistency without explicit annotations. Extensive experiments on the public dataset demonstrate that RadarMP achieves reliable motion perception across diverse weather and illumination conditions, outperforming radar-based decoupled motion perception pipelines and enhancing perception capabilities for full-scenario autonomous driving systems.

[239] OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description

Quanxing Xu, Ling Zhou, Feifei Zhang, Jinyu Tian, Rubing Huang

Main category: cs.CV

TL;DR: OAD-Promoter is a novel approach that enhances LLM-based Visual Question Answering by mitigating language bias and improving out-of-distribution generalization through object-concentrated example generation, memory knowledge assistance, and optimized prompting.

DetailsMotivation: LLMs in VQA inherit language biases from massive training data, making predictions unreliable and struggling with out-of-distribution generalization despite strong knowledge reasoning capabilities.

Method: OAD-Promoter has three components: Object-concentrated Example Generation (OEG) for global captions and object samples, Memory Knowledge Assistance (MKA) for retrieving relevant knowledge from stored examples, and OAD Prompt for optimized LLM inference.

Result: Experiments show OAD-Promoter significantly improves LLM-based VQA performance in few-shot or zero-shot settings, achieving new state-of-the-art results.

Conclusion: OAD-Promoter effectively addresses language bias and domain-shift challenges in LLM-based VQA, enhancing reliability and generalization capabilities.

Abstract: Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

[240] Compression and Inference of Spiking Neural Networks on Resource-Constrained Hardware

Karol C. Jurzec, Tomasz Szydlo, Maciej Wielgosz

Main category: cs.CV

TL;DR: A lightweight C-based runtime for efficient SNN inference on edge devices with optimizations that reduce latency and memory while maintaining accuracy.

DetailsMotivation: SNNs offer advantages for temporal processing and energy efficiency on resource-constrained hardware, but training and deployment remain challenging.

Method: Trained models from SNNTorch are translated to compact C representation with static, cache-friendly data layouts and preallocation. Sparse spiking activity is exploited to prune inactive neurons and synapses.

Result: Achieved ~10x speedups on desktop CPU with additional gains from pruning, large memory reductions enabling microcontroller deployment (Arduino Portenta H7), and functional parity with Python baseline on N-MNIST and ST-MNIST datasets.

Conclusion: SNNs can be executed efficiently on conventional embedded platforms when paired with an optimized runtime and spike-driven model compression.

Abstract: Spiking neural networks (SNNs) communicate via discrete spikes in time rather than continuous activations. Their event-driven nature offers advantages for temporal processing and energy efficiency on resource-constrained hardware, but training and deployment remain challenging. We present a lightweight C-based runtime for SNN inference on edge devices and optimizations that reduce latency and memory without sacrificing accuracy. Trained models exported from SNNTorch are translated to a compact C representation; static, cache-friendly data layouts and preallocation avoid interpreter and allocation overheads. We further exploit sparse spiking activity to prune inactive neurons and synapses, shrinking computation in upstream convolutional layers. Experiments on N-MNIST and ST-MNIST show functional parity with the Python baseline while achieving ~10 speedups on desktop CPU and additional gains with pruning, together with large memory reductions that enable microcontroller deployment (Arduino Portenta H7). Results indicate that SNNs can be executed efficiently on conventional embedded platforms when paired with an optimized runtime and spike-driven model compression. Code: https://github.com/karol-jurzec/snn-generator/

[241] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

Seokwon Song, Minsu Park, Gunhee Kim

Main category: cs.CV

TL;DR: MAVIS is the first benchmark for evaluating multimodal source attribution systems that handle visual questions, retrieve multimodal evidence, and generate cited long-form answers.

DetailsMotivation: Existing source attribution work has focused on text-only scenarios and overlooked multimodality, creating a need for multimodal evaluation benchmarks.

Method: Developed a dataset of 157K visual QA instances with fact-level citations to multimodal documents, created fine-grained automatic metrics for informativeness, groundedness, and fluency, and evaluated various prompting methods and LVLMs with multimodal RAG.

Result: LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but show weaker groundedness for image documents. There’s a trade-off between informativeness and groundedness across prompting methods.

Conclusion: Mitigating contextual bias in interpreting image documents is crucial for future research in multimodal source attribution systems.

Abstract: Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at https://github.com/seokwon99/MAVIS

[242] Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain

Yuqi Xie, Shuhan Ye, Yi Yu, Chong Wang, Qixin Zhang, Jiazhen Xu, Le Shen, Yuanbin Qian, Jiangbo Qian, Guoqi Li

Main category: cs.CV

TL;DR: TMKT is a cross-modal training framework that uses Time-step Mixup to bridge the gap between RGB and DVS modalities for spiking neural networks, enabling more effective knowledge transfer and better performance in spiking image classification.

DetailsMotivation: Event cameras and SNNs offer energy-efficient visual intelligence, but scarce event data and sparse DVS outputs make training difficult. Existing RGB-to-DVS knowledge transfer methods underperform due to substantial modality distribution gaps.

Method: Proposes TMKT with Time-step Mixup (TSM) strategy that interpolates RGB and DVS inputs at various time steps. Uses two modality-aware objectives: Modality Aware Guidance (MAG) for per-frame supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation.

Result: TMKT enables smoother knowledge transfer, mitigates modality mismatch during training, and achieves superior performance in spiking image classification tasks across diverse benchmarks and multiple SNN backbones.

Conclusion: The proposed TMKT framework with Time-step Mixup strategy effectively bridges the modality gap between RGB and DVS, providing stable optimization and improved performance for spiking neural networks in visual intelligence tasks.

Abstract: The integration of event cameras and spiking neural networks (SNNs) promises energy-efficient visual intelligence, yet scarce event data and the sparsity of DVS outputs hinder effective training. Prior knowledge transfers from RGB to DVS often underperform because the distribution gap between modalities is substantial. In this work, we present Time-step Mixup Knowledge Transfer (TMKT), a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy. TSM exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum within each sequence, which reduces gradient variance and stabilizes optimization with theoretical analysis. To employ auxiliary supervision from TSM, TMKT introduces two lightweight modality-aware objectives, Modality Aware Guidance (MAG) for per-frame source supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation, which explicitly align temporal features with the mixing schedule. TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks. Extensive experiments across diverse benchmarks and multiple SNN backbones, together with ablations, demonstrate the effectiveness of our method.

[243] FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing

Kaixiang Yang, Boyang Shen, Xin Li, Yuchen Dai, Yuxuan Luo, Yueran Ma, Wei Fang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: FIA-Edit is an inversion-free image editing framework that uses Frequency-Interactive Attention to achieve high-fidelity edits while preserving source structure and semantics, with applications in both general and medical domains.

DetailsMotivation: Existing flow-based inversion-free methods often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing.

Method: Uses Frequency-Interactive Attention with two key components: Frequency Representation Interaction (FRI) module for cross-domain alignment via frequency component exchange, and Feature Injection (FIJ) module that incorporates source-side queries, keys, values, and text embeddings into target branch’s cross-attention.

Result: Achieves high-fidelity editing at low computational cost (~6s per 512*512 image on RTX 4090), outperforms existing methods in visual quality, background fidelity, and controllability, and successfully extends to medical applications for surgical image augmentation.

Conclusion: FIA-Edit provides an efficient and effective solution for text-guided image editing that preserves source structure while enabling precise semantic edits, with demonstrated value in both general and clinical applications.

Abstract: Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch’s cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.

[244] Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function

Shuo Yin, Zhiyuan Yin, Yuqing Hou, Rui Liu, Yong Chen, Dell Zhang

Main category: cs.CV

TL;DR: CRH is an end-to-end deep hashing framework that dynamically reassigns hash centers from a preset codebook while jointly optimizing the hash function, eliminating the need for explicit center optimization phases and improving semantic relationship integration.

DetailsMotivation: Existing hash center-based methods suffer from random center initialization that ignores inter-class semantic relationships, and two-stage approaches introduce complexity, computational overhead, and performance gaps due to stage-wise discrepancies.

Method: Proposes Center-Reassigned Hashing (CRH) with dynamic hash center reassignment from a preset codebook, joint optimization of hash function, and a multi-head mechanism to enhance representational capacity of hash centers.

Result: Extensive experiments on three benchmarks show CRH learns semantically meaningful hash centers and outperforms state-of-the-art deep hashing methods in retrieval tasks.

Conclusion: CRH provides an effective end-to-end solution for deep hashing that dynamically adapts hash centers to data distribution while capturing richer semantic structures through multi-head mechanisms.

Abstract: Hash center-based deep hashing methods improve upon pairwise or triplet-based approaches by assigning fixed hash centers to each class as learning targets, thereby avoiding the inefficiency of local similarity optimization. However, random center initialization often disregards inter-class semantic relationships. While existing two-stage methods mitigate this by first refining hash centers with semantics and then training the hash function, they introduce additional complexity, computational overhead, and suboptimal performance due to stage-wise discrepancies. To address these limitations, we propose $\textbf{Center-Reassigned Hashing (CRH)}$, an end-to-end framework that $\textbf{dynamically reassigns hash centers}$ from a preset codebook while jointly optimizing the hash function. Unlike previous methods, CRH adapts hash centers to the data distribution $\textbf{without explicit center optimization phases}$, enabling seamless integration of semantic relationships into the learning process. Furthermore, $\textbf{a multi-head mechanism}$ enhances the representational capacity of hash centers, capturing richer semantic structures. Extensive experiments on three benchmarks demonstrate that CRH learns semantically meaningful hash centers and outperforms state-of-the-art deep hashing methods in retrieval tasks.

[245] Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective

Wang Luo, Di Wu, Hengyuan Na, Yinlin Zhu, Miao Hu, Guocong Quan

Main category: cs.CV

TL;DR: Proposes Completion-by-Correction paradigm for point cloud completion, shifting from unconstrained synthesis to guided refinement using topologically complete shape priors, with PGNet framework achieving state-of-the-art results.

DetailsMotivation: Current multimodal completion methods suffer from structural inconsistencies and topological artifacts due to limited geometric and semantic constraints in the Completion-by-Inpainting paradigm.

Method: PGNet: multi-stage framework with dual-feature encoding, coarse scaffold synthesis, and hierarchical correction via Completion-by-Correction paradigm using pretrained image-to-3D shape priors.

Result: Achieves 23.5% improvement in Chamfer Distance and 7.1% improvement in F-score on ShapeNetViPC dataset compared to state-of-the-art baselines.

Conclusion: Completion-by-Correction paradigm enables structurally consistent and observation-aligned reconstruction by shifting from unconstrained synthesis to guided refinement with topologically complete shape priors.

Abstract: Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5%) and F-score (+7.1%).

[246] MixAR: Mixture Autoregressive Image Generation

Jinyuan Hu, Jiayou Zhang, Shaobo Cui, Kun Zhang, Guangyi Chen

Main category: cs.CV

TL;DR: MixAR is a novel framework that uses mixture training paradigms to inject discrete tokens as prior guidance for continuous autoregressive modeling, improving generation fidelity while maintaining computational efficiency.

DetailsMotivation: Autoregressive approaches in discrete token spaces discard fine-grained information due to quantization and limited codebook size, while continuous latent spaces pose challenges for efficient autoregressive modeling due to their vast and unstructured nature.

Method: MixAR uses factorized formulation with discrete tokens as prior guidance for continuous AR prediction. It explores three mixture strategies: DC-SA (self-attention), DC-CA (cross-attention), and DC-Mix (replacing mask tokens with discrete counterparts). Also introduces TI-Mix to bridge training-inference distribution gaps.

Result: DC-Mix strategy achieves favorable balance between computational efficiency and generation fidelity, and TI-Mix consistently improves performance by aligning training and generation distributions.

Conclusion: MixAR successfully addresses challenges in continuous autoregressive modeling by leveraging discrete tokens as guidance, with DC-Mix and TI-Mix proving particularly effective for improving generation quality while maintaining efficiency.

Abstract: Autoregressive (AR) approaches, which represent images as sequences of discrete tokens from a finite codebook, have achieved remarkable success in image generation. However, the quantization process and the limited codebook size inevitably discard fine-grained information, placing bottlenecks on fidelity. Motivated by this limitation, recent studies have explored autoregressive modeling in continuous latent spaces, which offers higher generation quality. Yet, unlike discrete tokens constrained by a fixed codebook, continuous representations lie in a vast and unstructured space, posing significant challenges for efficient autoregressive modeling. To address these challenges, we introduce MixAR, a novel framework that leverages mixture training paradigms to inject discrete tokens as prior guidance for continuous AR modeling. MixAR is a factorized formulation that leverages discrete tokens as prior guidance for continuous autoregressive prediction. We investigate several discrete-continuous mixture strategies, including self-attention (DC-SA), cross-attention (DC-CA), and a simple approach (DC-Mix) that replaces homogeneous mask tokens with informative discrete counterparts. Moreover, to bridge the gap between ground-truth training tokens and inference tokens produced by the pre-trained AR model, we propose Training-Inference Mixture (TI-Mix) to achieve consistent training and generation distributions. In our experiments, we demonstrate a favorable balance of the DC-Mix strategy between computational efficiency and generation fidelity, and consistent improvement of TI-Mix.

[247] MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

Abdelrahman Elsayed, Ahmed Jaheen, Mohammad Yaqub

Main category: cs.CV

TL;DR: MMRINet is a lightweight brain tumor segmentation model using Mamba state-space models for efficient volumetric context modeling, achieving strong performance with only ~2.5M parameters.

DetailsMotivation: Address the computational challenges of deep 3D networks for brain tumor segmentation in resource-constrained clinical settings where such models are prohibitive.

Method: Proposes MMRINet with linear-complexity Mamba state-space models replacing quadratic attention, Dual-Path Feature Refinement modules for feature diversity, and Progressive Feature Aggregation for multi-scale fusion.

Result: Achieves average Dice score of 0.752 and average HD95 of 12.23 in BraTS-Lighthouse SSA 2025 with only ~2.5M parameters.

Conclusion: MMRINet demonstrates efficient and accurate brain tumor segmentation suitable for low-resource clinical environments.

Abstract: Automated brain tumor segmentation in multi-parametric MRI remains challenging in resource-constrained settings where deep 3D networks are computationally prohibitive. We propose MMRINet, a lightweight architecture that replaces quadratic-complexity attention with linear-complexity Mamba state-space models for efficient volumetric context modeling. Novel Dual-Path Feature Refinement (DPFR) modules maximize feature diversity without additional data requirements, while Progressive Feature Aggregation (PFA) enables effective multi-scale fusion. In the BraTS-Lighthouse SSA 2025, our model achieves strong performance with an average Dice score of (0.752) and an average HD95 of (12.23) with only ~2.5M parameters, demonstrating efficient and accurate segmentation suitable for low-resource clinical environments. Our GitHub repository can be accessed here: github.com/BioMedIA-MBZUAI/MMRINet.

[248] Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System

Aditi Bhalla, Christian Hellert, Enkelejda Kasneci

Main category: cs.CV

TL;DR: A novel two-phase framework for cross-view, cross-modal unsupervised domain adaptation in driver activity recognition that improves real-world deployment robustness by addressing both viewpoint variations and domain shifts.

DetailsMotivation: Driver distraction causes many accidents, but current deep learning methods struggle with real-world deployment due to camera viewpoint variations and domain shifts across different vehicle configurations.

Method: Two-phase approach: 1) Learn view-invariant and action-discriminative features using contrastive learning on multi-view data within a single modality, 2) Perform domain adaptation to new modalities using information bottleneck loss without requiring labeled data from the new domain.

Result: Improves top-1 accuracy on RGB video data by almost 50% compared to supervised contrastive learning-based cross-view methods, and outperforms unsupervised domain adaptation-only methods by up to 5% using the same video transformer backbone.

Conclusion: The proposed joint framework effectively addresses both cross-view generalization and domain adaptation challenges, enabling more robust and scalable deployment of driver monitoring systems across diverse vehicle configurations.

Abstract: Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.

[249] Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation

Sujun Sun, Haowen Gu, Cheng Xie, Yanxu Ren, Mingwu Ren, Haofeng Zhang

Main category: cs.CV

TL;DR: Proposes Hierarchical Semantic Learning (HSL) framework for Cross-domain Few-shot Segmentation to address segmentation granularity gaps, achieving state-of-the-art performance.

DetailsMotivation: Existing CD-FSS methods focus on style gaps but ignore segmentation granularity gaps, leading to insufficient semantic discriminability for novel classes in target domains.

Method: Uses Dual Style Randomization (DSR) for diverse style simulation, Hierarchical Semantic Mining (HSM) with multi-scale superpixels for granularity learning, and Prototype Confidence-modulated Thresholding (PCMT) for segmentation ambiguity reduction.

Result: Extensive experiments on four target domain datasets demonstrate state-of-the-art performance.

Conclusion: The proposed HSL framework effectively addresses segmentation granularity gaps and enhances semantic discriminability in cross-domain few-shot segmentation.

Abstract: Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model’s ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.

[250] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

Main category: cs.CV

TL;DR: OmniSparse is a training-aware fine-grained sparse attention framework for long-video MLLMs that achieves 2.7x speedup and 2.4x memory reduction while maintaining full attention performance.

DetailsMotivation: Existing sparse attention methods fail to bridge training-inference gap and lack fine-grained token selection across queries, KV, and heads, leading to suboptimal performance and limited acceleration.

Method: Three adaptive mechanisms: (1) lazy-active query classification, (2) head-level dynamic KV budget allocation based on flattest head, (3) KV cache slimming via selective visual KV fetching.

Result: Matches full attention performance while achieving 2.7x speedup during prefill and 2.4x memory reduction during decoding.

Conclusion: OmniSparse effectively bridges training-inference gap and enables fine-grained sparse attention for long-video MLLMs with significant acceleration and memory efficiency.

Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

[251] LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image

Zhuojiang Cai, Yiheng Zhang, Meitong Guo, Mingdao Wang, Yuwang Wang

Main category: cs.CV

TL;DR: LSS3D is a high-quality image-to-3D approach that uses learnable spatial shifting to address multi-view inconsistencies and non-frontal input views, achieving superior 3D generation with complete geometric details and clean textures.

DetailsMotivation: Existing multi-view diffusion-based 3D generation methods suffer from shape and texture misalignment across views, leading to incomplete geometric details and textural ghosting, with poor robustness to oblique perspective inputs.

Method: Assign learnable spatial shifting parameters to each view and adjust views toward spatially consistent targets guided by reconstructed mesh, with input view as extra constraint for optimization to enhance robustness to non-frontal angles.

Result: Consistently achieves leading results in both geometric and texture evaluation metrics across flexible input viewpoints, with more complete geometric details and clean textures.

Conclusion: LSS3D effectively handles multi-view inconsistencies and non-frontal input views through learnable spatial shifting, providing high-quality 3D generation and contributing a comprehensive quantitative evaluation pipeline to the community.

Abstract: Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.

[252] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Jiaqi Wu, Yaosen Chen, Shuyuan Zhu

Main category: cs.CV

TL;DR: Proposes Geometry-guided Multi-View Diffusion Model for consistent multi-view image generation using geometric information extraction and attention mechanisms.

DetailsMotivation: Address computational challenges in maintaining cross-view consistency and generating high-resolution outputs in multi-view image generation for applications like 3D reconstruction, VR, and AR.

Method: Uses multi-view geometry extraction with depth/normal maps and segmentation masks, decoupled geometry-enhanced attention, adaptive learning strategy, iterative refinement, and dynamic geometry intensity adjustment.

Result: Generates images that are both consistent across views and rich in detail, improving overall image quality and detail preservation.

Conclusion: The proposed model effectively addresses cross-view consistency and detail quality issues in multi-view image generation through geometric guidance and adaptive mechanisms.

Abstract: Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://github.com/SobeyMIL/GeoMVD.com.

[253] A Novel AI-Driven System for Real-Time Detection of Mirror Absence, Helmet Non-Compliance, and License Plates Using YOLOv8 and OCR

Nishant Vasantkumar Hegde, Aditi Agarwal, Minal Moharir

Main category: cs.CV

TL;DR: AI system using YOLOv8 and EasyOCR automates traffic violation detection for helmet laws and rear-view mirror presence, achieving high precision (0.9147) and recall (0.886) with real-time monitoring capabilities.

DetailsMotivation: Manual enforcement of helmet laws and vehicle safety standards is resource-intensive and inconsistent, creating a need for automated solutions to enhance road safety enforcement efficiency.

Method: Leverages YOLOv8 for object detection and EasyOCR for license plate recognition, trained on custom annotated dataset with data augmentation. Uses Streamlit-based interface for real-time monitoring and includes advanced image preprocessing for challenging conditions.

Result: Achieves overall precision of 0.9147, recall of 0.886, mAP@50 of 0.843, and mAP@50:95 of 0.503, demonstrating strong detection capability under strict IoU thresholds.

Conclusion: Presents a practical and effective automated solution for traffic rule enforcement with considerations for real-world deployment, particularly innovative for detecting rear-view mirror absence on motorcycles.

Abstract: Road safety is a critical global concern, with manual enforcement of helmet laws and vehicle safety standards (e.g., rear-view mirror presence) being resource-intensive and inconsistent. This paper presents an AI-powered system to automate traffic violation detection, significantly enhancing enforcement efficiency and road safety. The system leverages YOLOv8 for robust object detection and EasyOCR for license plate recognition. Trained on a custom dataset of annotated images (augmented for diversity), it identifies helmet non-compliance, the absence of rear-view mirrors on motorcycles, an innovative contribution to automated checks, and extracts vehicle registration numbers. A Streamlit-based interface facilitates real-time monitoring and violation logging. Advanced image preprocessing enhances license plate recognition, particularly under challenging conditions. Based on evaluation results, the model achieves an overall precision of 0.9147, a recall of 0.886, and a mean Average Precision (mAP@50) of 0.843. The mAP@50 95 of 0.503 further indicates strong detection capability under stricter IoU thresholds. This work demonstrates a practical and effective solution for automated traffic rule enforcement, with considerations for real-world deployment discussed.

[254] Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber

Main category: cs.CV

TL;DR: MoS introduces a novel fusion paradigm for multimodal diffusion models using token-wise routing to create flexible, state-based interactions between modalities with minimal computational overhead.

DetailsMotivation: To develop a more efficient and flexible fusion mechanism for multimodal diffusion models that can precisely align token-level features across modalities while maintaining computational efficiency.

Method: Uses a learnable token-wise router with ε-greedy training strategy that sparsely selects top-k hidden states, creating denoising timestep- and input-dependent interactions between modalities.

Result: Achieves state-of-the-art results in text-to-image generation and editing with only 3B to 5B parameters, matching or surpassing models up to 4× larger.

Conclusion: MoS establishes a flexible and compute-efficient paradigm for scaling multimodal diffusion models through sparse, state-based modality fusion.

Abstract: We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities’ hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

[255] FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

Peng Zhang, Zhihui Lai, Wenting Chen, Xu Wu, Heng Kong

Main category: cs.CV

TL;DR: FaNe is a semantic-enhanced vision-language pre-training framework that addresses false negatives from similar texts and improves fine-grained cross-modal alignment in medical imaging.

DetailsMotivation: Existing medical VLP methods suffer from false negatives caused by semantically similar texts and lack sufficient fine-grained cross-modal alignment, limiting their effectiveness.

Method: Uses semantic-aware positive pair mining with adaptive normalization, text-conditioned sparse attention pooling for localized alignment, and hard-negative aware contrastive loss with adaptive reweighting.

Result: Achieves state-of-the-art performance on five medical imaging benchmarks for classification, detection, and segmentation tasks.

Conclusion: FaNe effectively addresses false negatives and enables fine-grained cross-modal alignment, demonstrating superior performance in medical vision-language pre-training.

Abstract: Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

[256] MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection

Andrea Moglia, Elia Clement Nastasio, Luca Mainardi, Pietro Cerveri

Main category: cs.CV

TL;DR: MiniGPT-Pancreas is a multimodal AI chatbot that integrates CT scans and text to assist clinicians in pancreas cancer diagnosis, achieving good classification accuracy but needing improvement in tumor detection.

DetailsMotivation: Pancreas imaging is challenging due to the organ's small size, blurred boundaries, and variable shape/position among patients, creating a need for AI assistance in cancer diagnosis.

Method: Fine-tuned MiniGPT-v2 model using cascaded training for pancreas detection, tumor classification, and tumor detection with multimodal prompts combining questions and CT scans from NIH, MSD, and AbdomenCT-1k datasets.

Result: Achieved IoU of 0.595/0.550 for pancreas detection, 87.6% accuracy for cancer classification, and multi-organ detection with liver IoU 0.8399, but lower tumor detection IoU of 0.168.

Conclusion: Promising solution for pancreas tumor classification but needs improvement in detection tasks, especially for pancreas tumors.

Abstract: Problem: Pancreas radiological imaging is challenging due to the small size, blurred boundaries, and variability of shape and position of the organ among patients. Goal: In this work we present MiniGPT-Pancreas, a Multimodal Large Language Model (MLLM), as an interactive chatbot to support clinicians in pancreas cancer diagnosis by integrating visual and textual information. Methods: MiniGPT-v2, a general-purpose MLLM, was fine-tuned in a cascaded way for pancreas detection, tumor classification, and tumor detection with multimodal prompts combining questions and computed tomography scans from the National Institute of Health (NIH), and Medical Segmentation Decathlon (MSD) datasets. The AbdomenCT-1k dataset was used to detect the liver, spleen, kidney, and pancreas. Results: MiniGPT-Pancreas achieved an Intersection over Union (IoU) of 0.595 and 0.550 for the detection of pancreas on NIH and MSD datasets, respectively. For the pancreas cancer classification task on the MSD dataset, accuracy, precision, and recall were 0.876, 0.874, and 0.878, respectively. When evaluating MiniGPT-Pancreas on the AbdomenCT-1k dataset for multi-organ detection, the IoU was 0.8399 for the liver, 0.722 for the kidney, 0.705 for the spleen, and 0.497 for the pancreas. For the pancreas tumor detection task, the IoU score was 0.168 on the MSD dataset. Conclusions: MiniGPT-Pancreas represents a promising solution to support clinicians in the classification of pancreas images with pancreas tumors. Future research is needed to improve the score on the detection task, especially for pancreas tumors.

[257] Suppressing VLM Hallucinations with Spectral Representation Filtering

Ameen Ali, Tamim Zoabi, Lior Wolf

Main category: cs.CV

TL;DR: SRF is a training-free method that reduces hallucinations in vision-language models by filtering out low-rank hallucination modes from feature representations using spectral analysis.

DetailsMotivation: Vision-language models often produce hallucinations due to over-reliance on language priors and imprecise cross-modal grounding, leading to descriptions of non-existent objects, attributes, or relations.

Method: Spectral Representation Filtering (SRF) identifies hallucination modes through eigendecomposition of covariance differences between truthful and hallucinatory caption features, then applies a soft spectral filter to attenuate these modes in deeper VLM layers.

Result: SRF consistently reduces hallucination rates across LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 models on MSCOCO, POPE-VQA, and other benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

Conclusion: SRF provides an effective post-hoc solution for hallucination suppression that requires no training, incurs zero inference overhead, and works across multiple VLM families while preserving semantic fidelity.

Abstract: Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model’s representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

[258] Model Inversion Attack Against Deep Hashing

Dongdong Zhao, Qiben Xu, Ranxin Fang, Baogang Song

Main category: cs.CV

TL;DR: DHMI is the first diffusion-based model inversion attack framework for deep hashing systems, successfully reconstructing high-quality training images from hash codes even in black-box settings where no training hash codes are available.

DetailsMotivation: Deep hashing introduces severe privacy risks as hash codes can potentially be used to reconstruct original training data, leading to threats like biometric forgery and privacy breaches. However, model inversion attacks specifically for deep hashing remain unexplored due to challenges like inaccessibility of training hash codes and the discrete Hamming space.

Method: DHMI clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors, then uses surrogate-guided denoising optimization with a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides refinement to generate high-fidelity, semantically consistent images.

Result: Experiments on multiple datasets show DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting. It outperforms existing state-of-the-art model inversion attacks in black-box scenarios.

Conclusion: DHMI demonstrates practical efficacy and confirms the critical privacy risks inherent in deep hashing systems, highlighting the need for better privacy protection in deep hashing applications.

Abstract: Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.

[259] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia-Bao Le-Tran, Phu Truong Thien, Cuong Dinh, Minh Nguyen, Nga Nguyen, Thuy T. N. Nguyen, Huy Gia Ngo, Tan Nhat Nguyen, Binh T. Nguyen, Monojit Choudhury

Main category: cs.CV

TL;DR: Fusionista2.0 is an optimized video retrieval system that reduces retrieval time by 75% while improving accuracy and user satisfaction through streamlined processing and an enhanced interface.

DetailsMotivation: To meet the Video Browser Showdown's demand for accurate results under strict time constraints by creating a faster, more user-friendly video retrieval system.

Method: Re-engineered core modules: ffmpeg for fast keyframe extraction, Vintern-1B-v3.5 for multilingual OCR, faster-whisper for real-time ASR, and lightweight vision-language models for efficient question answering. Also redesigned UI for better responsiveness and workflow.

Result: Retrieval time reduced by up to 75%, with increased accuracy and user satisfaction. System proved competitive for large-scale video search.

Conclusion: Fusionista2.0 successfully delivers a fast, accurate, and user-friendly video retrieval system optimized for time-constrained scenarios like VBS.

Abstract: The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

[260] A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

Puzhen Wu, Hexin Dong, Yi Lin, Yihao Ding, Yifan Peng

Main category: cs.CV

TL;DR: A dual-stage disease-aware framework for chest X-ray report generation that learns disease-specific semantic tokens and integrates them with visual features to improve clinical accuracy.

DetailsMotivation: Existing approaches lack sufficient disease-awareness in visual representations and adequate vision-language alignment for medical image analysis, leading to overlooked pathological features and inaccurate reports.

Method: Two-stage approach: Stage 1 learns Disease-Aware Semantic Tokens (DASTs) via cross-attention and multi-label classification with contrastive learning; Stage 2 uses Disease-Visual Attention Fusion (DVAF) and Dual-Modal Similarity Retrieval (DMSR) to integrate disease representations and retrieve relevant exemplars.

Result: Achieves state-of-the-art performance on CheXpert Plus, IU X-ray, and MIMIC-CXR datasets with significant improvements in clinical accuracy and linguistic quality.

Conclusion: The proposed disease-aware framework effectively addresses limitations in existing chest X-ray report generation methods by enhancing disease awareness and vision-language alignment.

Abstract: Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists’ workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

[261] CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, Yao Hu

Main category: cs.CV

TL;DR: CrossVid is the first benchmark for evaluating multimodal large language models’ cross-video reasoning capabilities, featuring 5,331 videos and 9,015 QA pairs across 10 tasks.

DetailsMotivation: Existing video understanding benchmarks focus on single-video analysis and lack assessment of models' ability to reason across multiple videos simultaneously, which is crucial for real-world video understanding scenarios.

Method: Created CrossVid benchmark with hierarchical tasks covering four high-level dimensions and ten specific tasks, providing diverse question formats (single-choice, multiple-choice, open-ended) to comprehensively evaluate spatial-temporal reasoning in cross-video contexts.

Result: Gemini-2.5-Pro performed best with 50.4% average accuracy, but most MLLMs struggle with cross-video reasoning due to inability to integrate or compare evidence across multiple videos.

Conclusion: CrossVid reveals significant limitations in current MLLMs’ cross-video reasoning capabilities and provides a foundation for future improvements in this area.

Abstract: Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs’ capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs’ spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions and ten specific tasks, thereby closely reflecting the complex and varied nature of real-world video understanding. Secondly, CrossVid provides 5,331 videos, along with 9,015 challenging question-answering pairs, spanning single-choice, multiple-choice, and open-ended question formats. Through extensive experiments on various open-source and closed-source MLLMs, we observe that Gemini-2.5-Pro performs best on CrossVid, achieving an average accuracy of 50.4%. Notably, our in-depth case study demonstrates that most current MLLMs struggle with CVR tasks, primarily due to their inability to integrate or compare evidence distributed across multiple videos for reasoning. These insights highlight the potential of CrossVid to guide future advancements in enhancing MLLMs’ CVR capabilities.

[262] ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Main category: cs.CV

TL;DR: ZoomEarth is an adaptive cropping-zooming framework for ultra-high-resolution remote sensing images that uses active perception to revisit information-rich regions, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Existing methods for processing ultra-high-resolution remote sensing images suffer from redundancy when handling finer visual inputs due to passive perception paradigms, limiting effective processing of rich fine-grained information.

Method: Proposed ZoomEarth framework with adaptive cropping-zooming using Region-Guided reward, trained via supervised fine-tuning and Group Relative Policy Optimization on the LRS-GRO benchmark dataset containing 17 question types across global, region, and object levels.

Result: Achieves state-of-the-art performance on LRS-GRO and zero-shot performance on three public UHR remote sensing benchmarks. Can be integrated with downstream models for cloud removal, denoising, segmentation, and image editing.

Conclusion: ZoomEarth demonstrates strong versatility and extensibility through active perception paradigm, enabling effective processing of ultra-high-resolution remote sensing images with seamless integration into various downstream applications.

Abstract: Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

[263] TM-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation

Yaxuan Jiao, Qing Xu, Yuxiang Luo, Xiangjian He, Zhen Chen, Wenting Duan

Main category: cs.CV

TL;DR: TM-UNet is a lightweight medical image segmentation framework that uses token sequence modeling with memory mechanisms to achieve efficient global reasoning with linear complexity.

DetailsMotivation: Transformer-based methods achieve good results but have high computational costs that hinder clinical deployment, requiring more efficient alternatives.

Method: Proposes TM-UNet with multi-scale token-memory (MSTM) blocks that transform 2D features into token sequences using spatial scanning and matrix memory cells to selectively retain contextual information with exponential gating and parallel pooling.

Result: TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost.

Conclusion: The proposed token-memory mechanism enables efficient global reasoning for medical segmentation with linear complexity, making it suitable for clinical deployment.

Abstract: Medical image segmentation is essential for clinical diagnosis and treatment planning. Although transformer-based methods have achieved remarkable results, their high computational cost hinders clinical deployment. To address this issue, we propose TM-UNet, a novel lightweight framework that integrates token sequence modeling with an efficient memory mechanism for efficient medical segmentation. Specifically, we introduce a multi-scale token-memory (MSTM) block that transforms 2D spatial features into token sequences through strategic spatial scanning, leveraging matrix memory cells to selectively retain and propagate discriminative contextual information across tokens. This novel token-memory mechanism acts as a dynamic knowledge store that captures long-range dependencies with linear complexity, enabling efficient global reasoning without redundant computation. Our MSTM block further incorporates exponential gating to identify token effectiveness and multi-scale contextual extraction via parallel pooling operations, enabling hierarchical representation learning without computational overhead. Extensive experiments demonstrate that TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost. The code is available at https://github.com/xq141839/TM-UNet.

[264] D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Shuochen Chang, Xiaofeng Zhang, Qingyang Liu, Li Niu

Main category: cs.CV

TL;DR: D³ToM is a dynamic token merging method that accelerates diffusion-based multimodal LLMs by selectively merging redundant visual tokens during denoising steps, reducing computational complexity while maintaining performance.

DetailsMotivation: Diffusion MLLMs suffer from slow inference due to full bidirectional self-attention over entire sequences, resulting in cubic decoding complexity that becomes impractical with thousands of visual tokens.

Method: Uses decider tokens from previous denoising steps to build importance maps, maintains most salient tokens, and merges redundant ones through similarity-based aggregation. Features dynamic merge ratios that vary with each denoising step.

Result: Extensive experiments show D³ToM accelerates inference while preserving competitive performance under equivalent computational budgets.

Conclusion: D³ToM provides an effective plug-and-play solution for accelerating Diffusion MLLMs through dynamic token merging without altering model parameters.

Abstract: Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.

[265] One target to align them all: LiDAR, RGB and event cameras extrinsic calibration for Autonomous Driving

Andrea Bertogalli, Giacomo Boracchi, Luca Magri

Main category: cs.CV

TL;DR: Novel multi-modal extrinsic calibration framework for event cameras, LiDARs, and RGB cameras using a specialized 3D calibration target with features for each modality, enabling one-shot joint calibration instead of pairwise methods.

DetailsMotivation: Need for precise multi-sensor alignment in autonomous driving systems, particularly addressing the challenging calibration of event cameras alongside traditional sensors like LiDAR and RGB cameras.

Method: Uses a custom 3D calibration target with planes for LiDARs, ChArUco patterns for RGB cameras, and active LED patterns for event cameras, enabling simultaneous perception by all three modalities in a single calibration process.

Result: Validated through extensive experiments on custom autonomous driving dataset, demonstrating accurate and robust calibration performance across all three sensor types.

Conclusion: The proposed framework successfully enables one-shot joint extrinsic calibration of multi-modal sensor systems, providing a practical solution for complex autonomous driving vision systems where precise sensor alignment is critical.

Abstract: We present a novel multi-modal extrinsic calibration framework designed to simultaneously estimate the relative poses between event cameras, LiDARs, and RGB cameras, with particular focus on the challenging event camera calibration. Core of our approach is a novel 3D calibration target, specifically designed and constructed to be concurrently perceived by all three sensing modalities. The target encodes features in planes, ChArUco, and active LED patterns, each tailored to the unique characteristics of LiDARs, RGB cameras, and event cameras respectively. This unique design enables a one-shot, joint extrinsic calibration process, in contrast to existing approaches that typically rely on separate, pairwise calibrations. Our calibration pipeline is designed to accurately calibrate complex vision systems in the context of autonomous driving, where precise multi-sensor alignment is critical. We validate our approach through an extensive experimental evaluation on a custom built dataset, recorded with an advanced autonomous driving sensor setup, confirming the accuracy and robustness of our method.

[266] Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Method

Chi Liu, Jincheng Liu, Congcong Zhu, Minghao Wang, Sheng Shen, Jia Gu, Tianqing Zhu, Wanlei Zhou

Main category: cs.CV

TL;DR: FreRec addresses bias in generative data augmentation for medical AI by recalibrating frequency misalignment between real and synthesized images, improving downstream classification performance.

DetailsMotivation: Medical AI suffers from data scarcity and generative data augmentation (GDA) often introduces detrimental biases, particularly frequency misalignment that harms downstream tasks.

Method: Proposes Frequency Recalibration (FreRec) with two components: Statistical High-frequency Replacement (SHR) for rough alignment and Reconstructive High-frequency Mapping (RHM) for quality enhancement and detail reconstruction.

Result: Extensive experiments on brain MRIs, chest X-rays, and fundus images show FreRec significantly improves medical image classification performance compared to uncalibrated AI-synthesized samples.

Conclusion: FreRec is an effective standalone post-processing method compatible with any generative model that reduces frequency distributional discrepancy and enhances GDA reliability in medical applications.

Abstract: Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.

[267] LiDAR-GS++:Improving LiDAR Gaussian Reconstruction via Diffusion Priors

Qifeng Chen, Jiarun Liu, Rengan Xie, Tao Tang, Sicong Du, Yiru Zhao, Yuchi Huo, Sheng Yang

Main category: cs.CV

TL;DR: LiDAR-GS++ enhances LiDAR Gaussian Splatting with diffusion priors to address artifacts in extrapolated novel view synthesis, achieving state-of-the-art performance for both interpolated and extrapolated viewpoints.

DetailsMotivation: Current GS-based LiDAR rendering methods suffer from artifacts in extrapolated novel view synthesis due to incomplete reconstruction from single traversal scans.

Method: Introduces a controllable LiDAR generation model using diffusion priors conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans, with an effective distillation mechanism for expansive reconstruction.

Result: Achieves state-of-the-art performance on multiple public datasets for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.

Conclusion: LiDAR-GS++ enables real-time, high-fidelity LiDAR re-simulation with global geometric consistency while preserving detailed scene surfaces, addressing limitations of single-scan reconstruction.

Abstract: Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive reconstruction. By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.

[268] DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani, Monica Lama, Mick Yang, Zixuan Bian, Ruohan Ren, Alan B. Hong, Jiatao Gu, Chris Callison-Burch

Main category: cs.CV

TL;DR: DenseAnnotate is an audio-driven platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets through speech narration, addressing limitations of traditional text-based annotation methods.

DetailsMotivation: Current training datasets for multimodal LLMs rely on sparse annotations from internet mining or manual typing, which capture only limited visual content. Dense annotations are more valuable but scarce, especially for specialized domains like multicultural imagery and 3D assets.

Method: An audio-driven online annotation platform where annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. The platform incorporates speech-to-text transcription and region-of-attention marking.

Result: Created a human-annotated multimodal dataset with 3,531 images, 898 3D scenes, 7,460 3D objects, and audio-aligned dense annotations in 20 languages. Models trained on this dataset showed 5% improvement in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities.

Conclusion: DenseAnnotate provides a feasible approach for creating high-quality dense annotations that significantly enhance model performance across multiple domains, making it valuable for future vision-language research and various multimodal tasks.

Abstract: With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image’s visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.

[269] Learning Time in Static Classifiers

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

Main category: cs.CV

TL;DR: A framework that adds temporal reasoning to standard classifiers without architectural changes, using a Support-Exemplar-Query paradigm to learn from temporally coherent data trajectories.

DetailsMotivation: Real-world visual data evolves gradually over time, but conventional classifiers assume temporal independence, limiting their ability to capture dynamic changes in pose, lighting, object state, or scene context.

Method: Support-Exemplar-Query (SEQ) learning paradigm that structures training data into temporally coherent trajectories, enabling learning of class-specific temporal prototypes and sequence alignment via differentiable soft-DTW loss with multi-term objective for semantic consistency and temporal smoothness.

Result: Effective in both static and temporal tasks: enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection.

Conclusion: The approach bridges static and temporal learning in a modular, data-efficient manner using only simple classifiers on pre-extracted features, introducing temporal inductive bias through loss design alone.

Abstract: Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.

[270] Co-Layout: LLM-driven Co-optimization for Interior Layout

Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu, Yirui Guan, Ligang Liu

Main category: cs.CV

TL;DR: A framework combining LLMs and integer programming for automated interior design that jointly optimizes room layout and furniture placement from text prompts.

DetailsMotivation: To create an automated interior design system that can interpret textual descriptions and generate optimal spatial arrangements while addressing key design requirements like connectivity and accessibility.

Method: Uses LLM-driven agent workflow to extract design constraints, encodes them into grid-based representation, and applies coarse-to-fine integer programming optimization with corridor connectivity, room accessibility, and spatial exclusivity constraints.

Result: Significantly outperforms existing two-stage design pipelines in solution quality and achieves notable computational efficiency through the coarse-to-fine optimization strategy.

Conclusion: The joint optimization framework effectively combines LLM interpretation with mathematical optimization to produce high-quality interior designs efficiently.

Abstract: We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.

[271] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi

Main category: cs.CV

TL;DR: Training-free framework improves VLMs’ negation understanding by modeling negation as a subspace in embedding space, achieving 30% average improvement without compromising zero-shot performance.

DetailsMotivation: VLMs struggle with negation and existing fine-tuning methods compromise zero-shot performance on affirmative prompts.

Method: Models negation as a subspace in joint embedding space using spherical caps around embeddings, scoring images by proximity to affirmative and distance from negated concepts.

Result: 30% average improvement on negation understanding across retrieval, MCQ, and text-to-image tasks, closing gap between affirmative and negated prompts.

Conclusion: Proposed training-free method effectively handles negation in VLMs while preserving zero-shot capabilities that fine-tuned models lose.

Abstract: Vision-Language Models (VLMs) struggle with negation. Given a prompt like “retrieve (or generate) a street scene without pedestrians,” they often fail to respect the “not.” Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model’s zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as “A but not N,” we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

[272] Ground Plane Projection for Improved Traffic Analytics at Intersections

Sajjad Pakdamansavoji, Kumar Vaibhav Jha, Baher Abdulhai, James H Elder

Main category: cs.CV

TL;DR: Back-projecting vehicle detections from infrastructure cameras to ground plane coordinates improves turning movement count accuracy compared to image plane analysis, with multi-camera fusion providing highest accuracy.

DetailsMotivation: Accurate turning movement counts at intersections are crucial for signal control, traffic management and urban planning, but current computer vision systems analyze traffic in the image plane which may not be optimal.

Method: Back-projecting vehicle detections from infrastructure cameras to real-world 3D ground plane coordinates, with both single-camera and multi-camera weak fusion approaches.

Result: Back-projection yields more accurate trajectory classification and turning movement counts than image plane analysis, with multi-camera fusion achieving even higher accuracy.

Conclusion: Traffic should be analyzed on the ground plane rather than the image plane for more accurate turning movement counts.

Abstract: Accurate turning movement counts at intersections are important for signal control, traffic management and urban planning. Computer vision systems for automatic turning movement counts typically rely on visual analysis in the image plane of an infrastructure camera. Here we explore potential advantages of back-projecting vehicles detected in one or more infrastructure cameras to the ground plane for analysis in real-world 3D coordinates. For single-camera systems we find that back-projection yields more accurate trajectory classification and turning movement counts. We further show that even higher accuracy can be achieved through weak fusion of back-projected detections from multiple cameras. These results suggeest that traffic should be analyzed on the ground plane, not the image plane

[273] CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

Asmit Bandyopadhyay, Anindita Das Bhattacharjee, Rakesh Das

Main category: cs.CV

TL;DR: CLAReSNet is a hybrid CNN-transformer architecture for hyperspectral image classification that combines multi-scale convolutional feature extraction with transformer-style attention via an adaptive latent bottleneck, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: HSI classification faces challenges including high spectral dimensionality, complex spectral-spatial correlations, limited training samples, and severe class imbalance. Existing methods using CNNs or transformers in isolation yield suboptimal results due to quadratic complexity and insufficient inductive biases.

Method: Proposes CLAReSNet with multi-scale convolutional stem with deep residual blocks and enhanced CBAM for spatial features, followed by spectral encoder combining bidirectional RNNs with Multi-Scale Spectral Latent Attention (MSLA) that reduces complexity from O(T²D) to O(Tlog(T)D) via adaptive latent token allocation.

Result: Achieved state-of-the-art performance on Indian Pines (99.71% accuracy) and Salinas (99.96% accuracy) datasets, significantly surpassing HybridSN, SSRN, and SpectralFormer. Learned embeddings show superior inter-class separability and compact intra-class clustering.

Conclusion: CLAReSNet effectively addresses HSI classification challenges under limited samples and severe class imbalance through its hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via adaptive latent bottleneck.

Abstract: Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from $\mathcal{O}(T^2D)$ to $\mathcal{O}(T\log(T)D)$ by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet’s effectiveness under limited samples and severe class imbalance.

[274] Explainable AI-Generated Image Detection RewardBench

Michael Yang, Shijian Deng, William T. Doan, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian

Main category: cs.CV

TL;DR: XAIGID-RewardBench is the first benchmark to evaluate MLLMs’ ability to judge explanations for AI-generated image detection, revealing a significant gap between current models and human-level performance.

DetailsMotivation: Current AI-generated image detection methods lack explainable reasoning, and while MLLMs are being used to generate explanations, their ability to evaluate these explanations hasn't been properly studied.

Method: Created a benchmark with ~3,000 annotated triplets from various image generation models and MLLMs as detectors, using MLLMs as reward models to judge explanation quality.

Result: Best reward model scored 88.76% on the benchmark, while human inter-annotator agreement reached 98.30%, showing a significant performance gap between MLLMs and humans.

Conclusion: There remains a visible gap between current MLLMs’ reasoning abilities and human-level performance in judging explanations for AI-generated image detection, with common pitfalls identified.

Abstract: Conventional, classification-based AI-generated image detection methods cannot explain why an image is considered real or AI-generated in a way a human expert would, which reduces the trustworthiness and persuasiveness of these detection tools for real-world applications. Leveraging Multimodal Large Language Models (MLLMs) has recently become a trending solution to this issue. Further, to evaluate the quality of generated explanations, a common approach is to adopt an “MLLM as a judge” methodology to evaluate explanations generated by other MLLMs. However, how well those MLLMs perform when judging explanations for AI-generated image detection generated by themselves or other MLLMs has not been well studied. We therefore propose \textbf{XAIGID-RewardBench}, the first benchmark designed to evaluate the ability of current MLLMs to judge the quality of explanations about whether an image is real or AI-generated. The benchmark consists of approximately 3,000 annotated triplets sourced from various image generation models and MLLMs as policy models (detectors) to assess the capabilities of current MLLMs as reward models (judges). Our results show that the current best reward model scored 88.76% on this benchmark (while human inter-annotator agreement reaches 98.30%), demonstrating that a visible gap remains between the reasoning abilities of today’s MLLMs and human-level performance. In addition, we provide an analysis of common pitfalls that these models frequently encounter. Code and benchmark are available at https://github.com/RewardBench/XAIGID-RewardBench.

[275] Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Yiqing Shen, Mathias Unberath

Main category: cs.CV

TL;DR: DT-R1 is a reinforcement learning framework that uses digital twin representations for unified visual reasoning across multiple modalities and tasks, outperforming task-specific models.

DetailsMotivation: Existing visual reasoning approaches require task-specific architectures and training, preventing unified solutions and limiting cross-task generalization.

Method: Train large language models using GRPO with a novel reward that validates structural integrity and output accuracy to construct digital twin representations of visual inputs.

Result: DT-R1 achieves consistent improvements over state-of-the-art task-specific models across six visual reasoning benchmarks covering two modalities and four task types.

Conclusion: DT-R1 demonstrates that visual reasoning can emerge from reinforcement learning with digital twin representations, opening a new research direction.

Abstract: Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

[276] Fast Reasoning Segmentation for Images and Videos

Yiqing Shen, Mathias Unberath

Main category: cs.CV

TL;DR: FastReasonSeg enables efficient reasoning segmentation by using digital twin representations and a novel distillation approach that transfers multi-step reasoning capabilities to smaller models suitable for edge devices.

DetailsMotivation: Existing reasoning segmentation methods require large multimodal models that exceed computational capabilities of edge devices where embodied AI systems are deployed. Current distillation approaches fail to transfer multi-step reasoning capabilities needed for reasoning segmentation.

Method: Proposes FastReasonSeg using digital twin representations to decouple perception from reasoning. Uses two-stage distillation: supervised fine-tuning on teacher-generated reasoning chains followed by reinforcement fine-tuning with joint rewards for segmentation accuracy and reasoning quality.

Result: Achieves state-of-the-art performance on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K). The distilled 0.6B variant outperforms models with 20x more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption.

Conclusion: FastReasonSeg enables efficient deployment in resource-constrained environments for real-time reasoning segmentation, making it suitable for embodied agents operating autonomously in real-world environments.

Abstract: Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

[277] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Sünderhauf, Dimity Miller

Main category: cs.CV

TL;DR: First online scene change detection method that is pose-agnostic, label-free, and ensures multi-view consistency, achieving SOTA performance at over 10 FPS while surpassing offline approaches.

DetailsMotivation: Existing online SCD methods are significantly less accurate than offline approaches, and there's a need for pose-agnostic, label-free methods that ensure multi-view consistency while operating in real-time.

Method: Self-supervised fusion loss to infer scene changes from multiple cues, PnP-based fast pose estimation against reference scene, and fast change-guided update strategy for 3D Gaussian Splatting scene representation.

Result: Outperforms both online and offline baselines on complex real-world datasets, achieving new state-of-the-art performance while operating at over 10 FPS.

Conclusion: The approach successfully addresses the challenges of online scene change detection by combining efficient pose estimation, self-supervised learning, and optimized scene representation updates.

Abstract: Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

[278] Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

Yiqing Shen, Chenxiao Fan, Chenjia Li, Mathias Unberath

Main category: cs.CV

TL;DR: The paper introduces reasoning text-to-video retrieval to handle implicit queries requiring reasoning, using digital twin representations and large language models to outperform existing methods by over 50 percentage points.

DetailsMotivation: Existing text-to-video retrieval methods fail with implicit queries that require reasoning, as they only handle explicit descriptions of visual content.

Method: Two-stage framework: 1) compositional alignment between sub-queries and digital twin representations for candidate identification, 2) LLM-based reasoning with just-in-time refinement using specialist models to address information gaps.

Result: Achieves 81.2% R@1 on ReasonT2VBench-135 (outperforming strongest baseline by >50 percentage points), 81.7% R@1 on ReasonT2VBench-1000, and state-of-the-art results on MSR-VTT, MSVD, and VATEX benchmarks.

Conclusion: The proposed reasoning text-to-video retrieval paradigm successfully handles implicit queries through structured scene representations and LLM reasoning, significantly outperforming existing methods.

Abstract: The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

[279] AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification

Ansh Makwe, Akansh Agrawal, Prateek Jain, Akshan Agrawal, Priyanka Bagade

Main category: cs.CV

TL;DR: AGGRNet framework extracts informative and non-informative features to improve fine-grained medical image classification by addressing inter-class similarity and intra-class variability challenges.

DetailsMotivation: Existing attention-based models struggle with subtle class distinctions in medical imaging due to complex visual patterns, limited labeled data, and expert interpretation variability, leading to incorrect diagnoses.

Method: Proposed AGGRNet framework that extracts both informative and non-informative features to better understand fine-grained visual patterns in medical images.

Result: Achieves state-of-the-art performance on various medical imaging datasets, with up to 5% improvement over SOTA models on the Kvasir dataset.

Conclusion: AGGRNet effectively addresses challenges in complex medical image analysis by capturing fine-grained visual patterns through dual feature extraction, leading to improved classification accuracy.

Abstract: Medical image analysis for complex tasks such as severity grading and disease subtype classification poses significant challenges due to intricate and similar visual patterns among classes, scarcity of labeled data, and variability in expert interpretations. Despite the usefulness of existing attention-based models in capturing complex visual patterns for medical image classification, underlying architectures often face challenges in effectively distinguishing subtle classes since they struggle to capture inter-class similarity and intra-class variability, resulting in incorrect diagnosis. To address this, we propose AGGRNet framework to extract informative and non-informative features to effectively understand fine-grained visual patterns and improve classification for complex medical image analysis tasks. Experimental results show that our model achieves state-of-the-art performance on various medical imaging datasets, with the best improvement up to 5% over SOTA models on the Kvasir dataset.

[280] Leveraging Quantum-Based Architectures for Robust Diagnostics

Shabnam Sodagari, Tommy Long

Main category: cs.CV

TL;DR: Hybrid quantum-classical framework using ResNet50 encoder and Quantum CNN achieves 0.99 accuracy in diagnosing kidney stones, cysts, and tumors from CT images, with 12-qubit configuration showing superior performance.

DetailsMotivation: To improve medical diagnosis of kidney conditions (stones, cysts, tumors) using CT images by leveraging quantum computing advantages through a hybrid quantum-classical approach.

Method: Combines pretrained ResNet50 encoder with Quantum CNN, using image preprocessing (denoising, CLAHE), data augmentation, and angle encoding to transform features into qubits. Evaluated on 8-qubit and 12-qubit configurations.

Result: Both configurations achieved 0.99 test accuracy with stable convergence. 12-qubit model showed improved recall and precision - perfect recall for cysts and 0.9956 F1-score for tumors. Very few misclassifications across all classes.

Conclusion: Integration of classical preprocessing and deep feature extraction with quantum circuits enhances medical diagnostic performance for kidney condition classification.

Abstract: The objective of this study is to diagnose and differentiate kidney stones, cysts, and tumors using Computed Tomography (CT) images of the kidney. This study leverages a hybrid quantum-classical framework in this regard. We combine a pretrained ResNet50 encoder, with a Quantum Convolutional Neural Network (QCNN) to explore quantum-assisted diagnosis. We pre-process the kidney images using denoising and contrast limited adaptive histogram equalization to enhance feature extraction. We address class imbalance through data augmentation and weighted sampling. Latent features extracted by the encoder are transformed into qubits via angle encoding and processed by a QCNN. The model is evaluated on both 8-qubit and 12-qubit configurations. Both architectures achieved rapid convergence with stable learning curves and high consistency between training and validation performance. The models reached a test accuracy of 0.99, with the 12-qubit configuration providing improvements in overall recall and precision, particularly for Cyst and Tumor detection, where it achieved perfect recall for Cysts and a tumor F1-score of 0.9956. Confusion matrix analysis further confirmed reliable classification behavior across all classes, with very few misclassifications. Results demonstrate that integrating classical pre-processing and deep feature extraction with quantum circuits enhances medical diagnostic performance.

[281] Calibrated Decomposition of Aleatoric and Epistemic Uncertainty in Deep Features for Inference-Time Adaptation

Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, Amit Ranjan Trivedi

Main category: cs.CV

TL;DR: A framework that disentangles aleatoric and epistemic uncertainty in deep feature space to guide inference-time compute allocation and model selection, achieving 60% compute reduction with minimal accuracy loss.

DetailsMotivation: Traditional estimators collapse all uncertainty modes into single confidence scores, preventing reliable reasoning about when to allocate more compute or adjust inference strategies.

Method: Uncertainty-Guided Inference-Time Selection framework using regularized global density model for aleatoric uncertainty, and three orthogonal components (local support deficiency, manifold spectral collapse, cross-layer feature inconsistency) for epistemic uncertainty without sampling, ensembling, or additional forward passes.

Result: 60% compute reduction on MOT17 with negligible accuracy loss; 13.6 percentage point improvement in computational savings over total-uncertainty baseline; significantly tighter prediction intervals at matched coverage through conformal calibration.

Conclusion: The proposed orthogonal uncertainty decomposition enables practical self-regulating visual inference through uncertainty-guided adaptive model selection, providing substantial computational savings while maintaining performance.

Abstract: Most estimators collapse all uncertainty modes into a single confidence score, preventing reliable reasoning about when to allocate more compute or adjust inference. We introduce Uncertainty-Guided Inference-Time Selection, a lightweight inference time framework that disentangles aleatoric (data-driven) and epistemic (model-driven) uncertainty directly in deep feature space. Aleatoric uncertainty is estimated using a regularized global density model, while epistemic uncertainty is formed from three complementary components that capture local support deficiency, manifold spectral collapse, and cross-layer feature inconsistency. These components are empirically orthogonal and require no sampling, no ensembling, and no additional forward passes. We integrate the decomposed uncertainty into a distribution free conformal calibration procedure that yields significantly tighter prediction intervals at matched coverage. Using these components for uncertainty guided adaptive model selection reduces compute by approximately 60 percent on MOT17 with negligible accuracy loss, enabling practical self regulating visual inference. Additionally, our ablation results show that the proposed orthogonal uncertainty decomposition consistently yields higher computational savings across all MOT17 sequences, improving margins by 13.6 percentage points over the total-uncertainty baseline.

[282] MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting

Xu Yang, Gady Agam

Main category: cs.CV

TL;DR: MSLoRA is a parameter-efficient adapter that reweights feature responses for both CNNs and ViTs using low-rank projections and multi-scale nonlinear transformations, achieving strong performance with <5% parameters.

DetailsMotivation: Existing low-rank adaptation methods are mostly confined to ViTs and struggle to generalize across different architectures like CNNs.

Method: Combines low-rank linear projection with multi-scale nonlinear transformation that jointly modulates spatial and channel attention, fused through pointwise multiplication and residual connection.

Result: Consistently improves transfer performance on classification, detection, and segmentation tasks with <5% backbone parameters, enabling stable optimization and fast convergence.

Conclusion: MSLoRA provides a simple and universal approach for efficient adaptation of frozen vision backbones by reweighting rather than re-tuning features.

Abstract: We introduce MSLoRA, a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone. Existing low-rank adaptation methods are mostly confined to vision transformers (ViTs) and struggle to generalize across architectures. MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and ViTs by combining a low-rank linear projection with a multi-scale nonlinear transformation that jointly modulates spatial and channel attention. The two components are fused through pointwise multiplication and a residual connection, yielding a lightweight module that shifts feature attention while keeping pretrained weights frozen. Extensive experiments demonstrate that MSLoRA consistently improves transfer performance on classification, detection, and segmentation tasks with roughly less than 5% of backbone parameters. The design further enables stable optimization, fast convergence, and strong cross-architecture generalization. By reweighting rather than re-tuning, MSLoRA provides a simple and universal approach for efficient adaptation of frozen vision backbones.

[283] VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim

Main category: cs.CV

TL;DR: VLA-R is an open-world end-to-end autonomous driving framework that uses vision-language models for perception and vision-action retrieval for driving in unstructured environments.

DetailsMotivation: End-to-end autonomous driving in unstructured outdoor environments often encounters unfamiliar conditions during deployment, requiring strong generalization capabilities beyond training data.

Method: Integrates open-world perception using frozen vision-language models for detection/segmentation, Q-Former bottleneck for feature aggregation, and vision-action contrastive learning to align vision-language and action embeddings.

Result: Demonstrates strong generalization and exploratory performance in unstructured, unseen environments on real-world robotic platforms, even with limited training data.

Conclusion: The proposed VLA-R framework effectively handles open-world autonomous driving by combining vision-language perception with vision-action retrieval, enabling transferable driving behaviors in unfamiliar environments.

Abstract: Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

[284] Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Xi Xiao, Zhuxuanzi Wang, Mingqiao Mo, Chen Liu, Chenrui Ma, Yanshu Li, Smita Krishnaswamy, Xiao Wang, Tianyang Wang

Main category: cs.CV

TL;DR: PROBE is a self-supervised framework that uses visual prompting to improve cross-domain pavement defect detection without requiring target domain labels.

DetailsMotivation: Automated pavement defect detection suffers from poor cross-domain generalization, with supervised methods needing expensive re-annotation and standard self-supervised methods being vulnerable to domain shift.

Method: Introduces Self-supervised Prompt Enhancement Module (SPEM) to derive defect-aware prompts from unlabeled target data, and Domain-Aware Prompt Alignment (DAPA) to align prompt-conditioned source and target representations.

Result: Outperforms supervised, self-supervised, and adaptation baselines on four benchmarks, achieving robust zero-shot transfer, improved domain resilience, and high data efficiency in few-shot adaptation.

Conclusion: Self-supervised prompting is a practical direction for building scalable and adaptive visual inspection systems.

Abstract: The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

[285] Towards Rotation-only Imaging Geometry: Rotation Estimation

Xinrui Li, Qi Cai, Yuanxin Wu

Main category: cs.CV

TL;DR: Rotation-only optimization framework for Structure from Motion that condenses imaging geometry onto rotation manifold, achieving state-of-the-art performance comparable to bundle adjustment.

DetailsMotivation: To explore the critical relationship between scene structures, rotation and translation in SfM, building on pose-only imaging geometry which demonstrated better performance through pose adjustment.

Method: Proposes a rotation-only optimization framework based on reprojection error that expresses translation in terms of rotation, condensing imaging geometry representation onto rotation manifold for both two-view and multi-view scenarios.

Result: Demonstrates superior accuracy and robustness over current state-of-the-art rotation estimation methods, with performance comparable to multiple bundle adjustment iterations.

Conclusion: This work contributes to more accurate, efficient and reliable 3D visual computing by leveraging rotation-only optimization in Structure from Motion.

Abstract: Structure from Motion (SfM) is a critical task in computer vision, aiming to recover the 3D scene structure and camera motion from a sequence of 2D images. The recent pose-only imaging geometry decouples 3D coordinates from camera poses and demonstrates significantly better SfM performance through pose adjustment. Continuing the pose-only perspective, this paper explores the critical relationship between the scene structures, rotation and translation. Notably, the translation can be expressed in terms of rotation, allowing us to condense the imaging geometry representation onto the rotation manifold. A rotation-only optimization framework based on reprojection error is proposed for both two-view and multi-view scenarios. The experiment results demonstrate superior accuracy and robustness performance over the current state-of-the-art rotation estimation methods, even comparable to multiple bundle adjustment iteration results. Hopefully, this work contributes to even more accurate, efficient and reliable 3D visual computing.

[286] Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance

Wenjie Li, Jinglei Shi, Jin Han, Heng Guo, Zhanyu Ma

Main category: cs.CV

TL;DR: DHGM is a diffusion-based model that integrates weather removal and super-resolution using high-frequency guidance to generate clean, high-resolution images without sacrificing fine details.

DetailsMotivation: Real-world images are often degraded by weather conditions, and existing methods that cascade weather removal and super-resolution struggle with conflicting objectives - removal eliminates high-frequency noise while SR generates high-frequency textures, leading to inconsistent results.

Method: Proposes DHGM, a Diffusion-based High-frequency Guided Model that integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details in a unified framework.

Result: Extensive experiments demonstrate that DHGM achieves superior performance over existing methods with lower computational costs, effectively generating clean and high-resolution images.

Conclusion: DHGM successfully addresses the conflict between weather removal and super-resolution by using diffusion priors with high-frequency guidance, providing an effective solution for generating high-quality images suitable for tasks like small object detection.

Abstract: Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.

[287] MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation

Nuolin Sun, Linyuan Wang, Haonan Wei, Lei Li, Bin Yan

Main category: cs.CV

TL;DR: MFI-ResNet uses MeanFlow modules to compress ResNet stages, reducing parameters by ~46% while improving accuracy on CIFAR datasets.

DetailsMotivation: To improve ResNet's parameter efficiency and performance by leveraging insights from flow matching models, viewing ResNet as discretized ODEs and using generative flow-fields to characterize feature transformations.

Method: Compression-expansion strategy: compress multi-layer ResNet stages to 1-2 MeanFlow modules, then selectively expand first three stages to baseline configuration while keeping last stage in MeanFlow form, followed by fine-tuning.

Result: Reduced parameters by 46.28% and 45.59% compared to ResNet-50 on CIFAR-10 and CIFAR-100 respectively, while improving accuracy by 0.23% and 0.17%.

Conclusion: Generative flow-fields can effectively characterize ResNet’s feature transformations, providing new insights into the relationship between generative modeling and discriminative learning.

Abstract: ResNet has achieved tremendous success in computer vision through its residual connection mechanism. ResNet can be viewed as a discretized form of ordinary differential equations (ODEs). From this perspective, the multiple residual blocks within a single ResNet stage essentially perform multi-step discrete iterations of the feature transformation for that stage. The recently proposed flow matching model, MeanFlow, enables one-step generative modeling by learning the mean velocity field to transform distributions. Inspired by this, we propose MeanFlow-Incubated ResNet (MFI-ResNet), which employs a compression-expansion strategy to jointly improve parameter efficiency and discriminative performance. In the compression phase, we simplify the multi-layer structure within each ResNet stage to one or two MeanFlow modules to construct a lightweight meta model. In the expansion phase, we apply a selective incubation strategy to the first three stages, expanding them to match the residual block configuration of the baseline ResNet model, while keeping the last stage in MeanFlow form, and fine-tune the incubated model. Experimental results show that on CIFAR-10 and CIFAR-100 datasets, MFI-ResNet achieves remarkable parameter efficiency, reducing parameters by 46.28% and 45.59% compared to ResNet-50, while still improving accuracy by 0.23% and 0.17%, respectively. This demonstrates that generative flow-fields can effectively characterize the feature transformation process in ResNet, providing a new perspective for understanding the relationship between generative modeling and discriminative learning.

[288] RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Souvik Kundu, Peter A. Beerel

Main category: cs.CV

TL;DR: RedVTP is a response-driven visual token pruning method for Diffusion Vision-Language Models that improves inference efficiency by pruning less important visual tokens after the first inference step, achieving significant speedups without accuracy loss.

DetailsMotivation: Diffusion Vision-Language Models (DVLMs) have high computational demands due to large numbers of visual tokens, and while visual token pruning has been studied for autoregressive VLMs, it remains unexplored for DVLMs.

Method: Proposes RedVTP that estimates visual token importance using attention from masked response tokens and prunes less important tokens after the first inference step, leveraging consistent importance scores across steps.

Result: Improves token generation throughput by up to 186% for LLaDA-V and 28.05% for LaViDa, reduces inference latency by up to 64.97% and 21.87% respectively, without compromising accuracy.

Conclusion: RedVTP effectively addresses the computational inefficiency of DVLMs through response-driven visual token pruning, achieving substantial speed improvements while maintaining or even improving model accuracy.

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

[289] Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

Xilai Li, Xiaosong Li, Weijun Jiang

Main category: cs.CV

TL;DR: UP-Fusion: A unified multi-modality image fusion framework using channel perturbation and pre-trained knowledge to overcome gradient conflicts and improve generalization across fusion tasks.

DetailsMotivation: Existing unified models for multi-modality image fusion suffer from gradient conflicts due to large modality differences, while modality-specific encoders reduce generalization across different fusion tasks.

Method: Proposes UP-Fusion with three modules: Semantic-Aware Channel Pruning Module (SCPM) for filtering redundant features using pre-trained knowledge, Geometric Affine Modulation Module (GAM) for maintaining modal discriminability, and Text-Guided Channel Perturbation Module (TCPM) for reshaping channel distribution during decoding.

Result: Extensive experiments show the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

Conclusion: UP-Fusion effectively addresses gradient conflicts and generalization limitations in multi-modality image fusion through channel perturbation and pre-trained knowledge integration.

Abstract: Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

[290] Real-Time Drivers’ Drowsiness Detection and Analysis through Deep Learning

ANK Zaman, Prosenjit Chatterjee, Rajat Sharma

Main category: cs.CV

TL;DR: Real-time driver drowsiness detection system using DCNNs and OpenCV to analyze facial landmarks like eye openings and mouth movements, achieving high accuracy on benchmark datasets.

DetailsMotivation: Long drives can cause driver drowsiness which is life-threatening. Current scenarios force drivers to drive without sufficient rest, necessitating a real-time detection system to prevent accidents.

Method: Uses deep convolutional neural networks (DCNNs) with OpenCV to analyze real-time facial images, detecting facial landmarks including eye openings and yawn-like mouth movements.

Result: Achieved 99.6% accuracy on NTHU-DDD dataset and 97% accuracy on Yawn-Eye-Dataset for drowsiness detection classification.

Conclusion: The proposed system offers a non-invasive, inexpensive, and cost-effective solution for real-time drowsiness detection that can be embedded in smart car technology to enhance road safety.

Abstract: A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers’ safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and OpenCV.Our proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car technology.By potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.

[291] CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

Jiahe Qian, Yuhao Shen, Zhangtianyi Chen, Juexiao Zhou, Peisong Wang

Main category: cs.CV

TL;DR: CoTBox-TTT is a test-time training method that adapts vision-language models at inference using soft prompts and visual chain-of-thought to improve medical VQA reliability under domain shift.

DetailsMotivation: Current medical VQA systems fail under domain shift and produce weakly grounded answers, creating reliability gaps when models attend to spurious regions and retraining is impractical.

Method: Test-time training approach that updates only continuous soft prompts, identifies question-relevant regions through visual chain-of-thought, and encourages answer consistency across original image and localized crop. Label-free and plug-and-play with frozen backbones.

Result: Improves medical VQA performance significantly - adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

Conclusion: The approach is practical for real deployments in medical VQA, providing reliable adaptation at inference time without requiring retraining or additional labels.

Abstract: Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

[292] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

Main category: cs.CV

TL;DR: MOON2.0 is a dynamic modality-balanced multimodal framework for e-commerce product understanding that addresses modality imbalance, underutilized alignment relationships, and data noise through Modality-driven MoE, Dual-level Alignment, and Image-text Co-augmentation with Dynamic Sample Filtering.

DetailsMotivation: To overcome three key challenges in e-commerce multimodal models: modality imbalance from mixed training, underutilization of intrinsic visual-textual alignment within products, and limited handling of noise in e-commerce multimodal data.

Method: Proposes three components: (1) Modality-driven Mixture-of-Experts for adaptive processing based on modality composition, (2) Dual-level Alignment to leverage semantic alignment within products, and (3) MLLM-based Image-text Co-augmentation with Dynamic Sample Filtering for data quality improvement.

Result: Achieves state-of-the-art zero-shot performance on MBE2.0 benchmark and multiple public datasets. Attention-based heatmap visualization shows improved multimodal alignment.

Conclusion: MOON2.0 effectively addresses key challenges in e-commerce multimodal representation learning and demonstrates superior performance through balanced modality processing and enhanced alignment strategies.

Abstract: The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

[293] MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

Jingshan Hong, Haigen Hu, Huihuang Zhang, Qianwei Zhou, Zhao Li

Main category: cs.CV

TL;DR: MaskAnyNet treats masked image regions as auxiliary knowledge rather than discarding them, using a relearning mechanism to exploit both visible and masked information for improved feature diversity and fine-grained detail preservation.

DetailsMotivation: Traditional image masking underutilizes discarded pixels and may remove critical features, while masked image modeling shows masked regions contain valuable contextual information and semantic diversity that can enhance learning.

Method: Proposes MaskAnyNet that combines masking with a relearning mechanism, adding an additional branch to any model to jointly learn from recomposed masked regions, treating masked content as auxiliary knowledge.

Result: Experiments on CNN and Transformer backbones show consistent performance gains across multiple benchmarks, with analysis confirming improved semantic diversity through masked content reuse.

Conclusion: Masked regions should be treated as valuable auxiliary knowledge rather than ignored, and the proposed approach effectively leverages semantic diversity from masked content to enrich features and preserve fine-grained details.

Abstract: In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

[294] Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion

Jongseong Bae, Junwoo Ha, Jinnyeong Heo, Yeongin Lee, Ha Young Kim

Main category: cs.CV

TL;DR: C3DFusion module improves 3D semantic scene completion by better reconstructing out-of-frame areas using temporal fusion of current and historical frames with noise suppression and feature enhancement.

DetailsMotivation: Existing camera-based 3D semantic scene completion methods struggle to reconstruct critical out-of-frame areas near ego-vehicle sides, despite previous frames containing valuable contextual information about these unseen regions.

Method: Proposes Current-Centric Contextual 3D Fusion (C3DFusion) module that generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from current and historical frames, using historical context blurring and current-centric feature densification.

Result: Significantly outperforms state-of-the-art methods on SemanticKITTI and SSCBench-KITTI-360 datasets, and shows robust generalization with notable performance gains when applied to other baseline models.

Conclusion: C3DFusion effectively addresses the limitation of reconstructing out-of-frame areas in 3D semantic scene completion through enhanced temporal fusion techniques.

Abstract: Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-of-frame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques-historical context blurring and current-centric feature densification-which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.

[295] Visible Structure Retrieval for Lightweight Image-Based Relocalisation

Fereidoon Zangeneh, Leonard Bruns, Amit Dekel, Alessandro Pieropan, Patric Jensfelt

Main category: cs.CV

TL;DR: Proposes a neural network that directly maps images to visible 3D scene structure, eliminating the need for image retrieval or search heuristics in camera relocalization.

DetailsMotivation: Existing structure-based relocalization methods rely on elaborate search heuristics or image retrieval with growing storage requirements, making them impractical for large scenes.

Method: Learns a direct mapping from images to visible scene structure using a novel visible structure retrieval network that outputs the subset of 3D points visible in a query image.

Result: Achieves localization accuracy comparable to state-of-the-art methods while requiring lower computational and storage footprint.

Conclusion: The proposed neural network paradigm enables tractable structure-based relocalization without the limitations of traditional retrieval-based approaches.

Abstract: Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.

[296] DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection

Jialiang Shen, Jiyang Zheng, Yunqi Xue, Huajie Chen, Yu Yao, Hui Kang, Ruiqi Liu, Helin Gong, Yang Yang, Dadong Wang, Tongliang Liu

Main category: cs.CV

TL;DR: A blur-robust AI-generated image detection framework using teacher-student knowledge distillation with DINOv3 teacher to maintain detection performance under motion blur degradation.

DetailsMotivation: Address performance degradation of AI-generated image detectors under real-world motion blur conditions that occur in handheld photography, fast motion, and compressed video.

Method: Teacher-student knowledge distillation with frozen DINOv3 teacher trained on clean images providing stable representations, distilled to student trained on blurred images to maintain consistent detection under motion degradation.

Result: Achieves state-of-the-art performance under both motion-blurred and clean conditions, demonstrating improved generalization and real-world applicability.

Conclusion: The proposed framework effectively addresses motion blur challenges in AI-generated image detection through knowledge distillation, enhancing real-world deployment capabilities.

Abstract: With growing concerns over image authenticity and digital safety, the field of AI-generated image (AIGI) detection has progressed rapidly. Yet, most AIGI detectors still struggle under real-world degradations, particularly motion blur, which frequently occurs in handheld photography, fast motion, and compressed video. Such blur distorts fine textures and suppresses high-frequency artifacts, causing severe performance drops in real-world settings. We address this limitation with a blur-robust AIGI detection framework based on teacher-student knowledge distillation. A high-capacity teacher (DINOv3), trained on clean (i.e., sharp) images, provides stable and semantically rich representations that serve as a reference for learning. By freezing the teacher to maintain its generalization ability, we distill its feature and logit responses from sharp images to a student trained on blurred counterparts, enabling the student to produce consistent representations under motion degradation. Extensive experiments benchmarks show that our method achieves state-of-the-art performance under both motion-blurred and clean conditions, demonstrating improved generalization and real-world applicability. Source codes will be released at: https://github.com/JiaLiangShen/Dino-Detect-for-blur-robust-AIGC-Detection.

[297] MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

Jing Li, Yifan Wang, Jiafeng Yan, Renlong Zhang, Bin Yang

Main category: cs.CV

TL;DR: Proposes MdaIF, a degradation-aware infrared and visible image fusion framework using LLM-driven MoE system and VLM semantic priors to handle multiple weather degradation scenarios.

DetailsMotivation: Existing methods fail to address visible image degradation in adverse weather conditions and use fixed architectures that limit adaptability to diverse degradation scenarios.

Method: Uses mixture-of-experts (MoE) system for different degradation scenarios, pre-trained VLM for semantic priors, degradation-aware channel attention module (DCAM) with prototype decomposition, and expert routing guided by semantic priors.

Result: Extensive experiments show superior performance over state-of-the-art methods in complex degradation scenarios.

Conclusion: MdaIF framework effectively addresses multi-degradation image fusion by leveraging LLM-driven MoE and semantic priors, demonstrating robust performance across various weather conditions.

Abstract: Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.

[298] D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang, Jiwei Zhang, Boyu Zhou, Linzhimeng Duan, Hong Chen

Main category: cs.CV

TL;DR: D²-VPR is a framework that combines knowledge distillation and deformable attention to reduce the computational cost of Visual Place Recognition while maintaining competitive performance with visual foundation models like DINOv2.

DetailsMotivation: Visual foundation models like DINOv2 improve VPR performance but have high computational complexity that hinders deployment on resource-constrained devices.

Method: Uses two-stage training with knowledge distillation and fine-tuning, plus a Distillation Recovery Module for feature alignment and a Top-Down-attention-based Deformable Aggregator for adaptive region selection.

Result: Achieves competitive performance with state-of-the-art methods while reducing parameters by ~64.2% and FLOPs by ~62.6% compared to CricaVPR.

Conclusion: The proposed framework successfully balances performance and efficiency, making VPR more deployable on resource-constrained devices without sacrificing accuracy.

Abstract: Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2’s exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and FLOPs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.

[299] ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, Haoran Duan

Main category: cs.CV

TL;DR: ReaSon is a framework that selects keyframes for video understanding by optimizing for causal necessity and predictive sufficiency using reinforcement learning and counterfactual interventions.

DetailsMotivation: Keyframe selection is crucial for video understanding with VLMs due to token limitations and temporal sparsity, requiring frames that are both informative and causally decisive.

Method: Uses a learnable policy network to select keyframes from candidate frames, assesses causal necessity via counterfactual interventions, and guides selection through reinforcement learning with a CIB-aligned reward.

Result: Consistently outperforms state-of-the-art methods on NExT-QA, EgoSchema, and Video-MME datasets under limited-frame settings.

Conclusion: ReaSon effectively selects causally necessary and predictively sufficient keyframes, demonstrating strong generalization ability across multiple video understanding benchmarks.

Abstract: Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

[300] HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

Zhiguang Lu, Qianqian Xu, Peisong Wen, Siran Da, Qingming Huang

Main category: cs.CV

TL;DR: HiGFA is a hierarchical diffusion-based data augmentation method that uses staged guidance (text/contour early, fine-grained classifier late) with confidence-based dynamic modulation to generate faithful synthetic images for fine-grained visual classification tasks.

DetailsMotivation: Standard diffusion models using text-based CFG lack specificity for fine-grained tasks, potentially generating misleading examples that degrade classifier performance due to failure to capture subtle category-defining features.

Method: Hierarchical guidance: early stages use strong text and contour guidance for scene/style/structure; final stages activate fine-grained classifier guidance with dynamic modulation based on prediction confidence to balance global structure with detail refinement.

Result: Experiments on multiple FGVC datasets demonstrate HiGFA’s effectiveness in generating diverse yet faithful synthetic images for fine-grained classification tasks.

Conclusion: HiGFA’s hierarchical, confidence-driven orchestration of guidance signals enables successful fine-grained data augmentation by intelligently balancing global structure formation with precise detail refinement throughout the diffusion sampling process.

Abstract: Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.

[301] EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Yijie Guo, Dexiang Hong, Weidong Chen, Zihan She, Cheng Ye, Xiaojun Chang, Zhendong Mao

Main category: cs.CV

TL;DR: EmoVerse introduces a large-scale open-source dataset with multi-layered annotations for interpretable visual emotion analysis, decomposing emotions into Background-Attribute-Subject triplets and providing dual categorical and dimensional emotion representations.

DetailsMotivation: To address the limitations in visual emotion analysis caused by lack of open-source datasets and limited interpretability, where most existing approaches assign single discrete emotion labels to entire images without explaining how visual elements contribute to emotions.

Method: Created EmoVerse dataset with 219k images using a multi-stage annotation pipeline that decomposes emotions into Background-Attribute-Subject triplets, grounds elements to visual regions, and provides dual annotations in both categorical and dimensional emotion spaces.

Result: The dataset enables word-level and subject-level emotional reasoning with high annotation reliability achieved through minimal human effort. An interpretable model was also introduced that maps visual cues to dimensional emotion representations with detailed attribution explanations.

Conclusion: EmoVerse provides a comprehensive foundation for advancing explainable high-level emotion understanding in visual content through its dataset, annotation pipeline, and interpretable model.

Abstract: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

[302] SEMC: Structure-Enhanced Mixture-of-Experts Contrastive Learning for Ultrasound Standard Plane Recognition

Qing Cai, Guihao Yan, Fan Zhang, Cheng Zhang, Zhi Liu

Main category: cs.CV

TL;DR: SEMC is a Structure-Enhanced Mixture-of-Experts Contrastive learning framework for ultrasound standard plane recognition that combines structure-aware feature fusion with expert-guided contrastive learning to improve recognition of structural and discriminative details.

DetailsMotivation: Existing methods fail to effectively exploit shallow structural information and struggle to capture fine-grained semantic differences through contrastive samples, resulting in suboptimal recognition of both structural and discriminative details in ultrasound standard planes.

Method: Proposes SEMC with two key modules: 1) Semantic-Structure Fusion Module (SSFM) to exploit multi-scale structural information and align shallow/deep features, 2) Mixture-of-Experts Contrastive Recognition Module (MCRM) for hierarchical contrastive learning using MoE mechanism. Also curated a large-scale liver ultrasound dataset with six standard planes.

Result: Extensive experiments on in-house dataset and two public datasets demonstrate that SEMC outperforms recent state-of-the-art methods across various metrics.

Conclusion: SEMC effectively addresses the limitations of existing methods by combining structure-aware feature fusion with expert-guided contrastive learning, achieving superior performance in ultrasound standard plane recognition.

Abstract: Ultrasound standard plane recognition is essential for clinical tasks such as disease screening, organ evaluation, and biometric measurement. However, existing methods fail to effectively exploit shallow structural information and struggle to capture fine-grained semantic differences through contrastive samples generated by image augmentations, ultimately resulting in suboptimal recognition of both structural and discriminative details in ultrasound standard planes. To address these issues, we propose SEMC, a novel Structure-Enhanced Mixture-of-Experts Contrastive learning framework that combines structure-aware feature fusion with expert-guided contrastive learning. Specifically, we first introduce a novel Semantic-Structure Fusion Module (SSFM) to exploit multi-scale structural information and enhance the model’s ability to perceive fine-grained structural details by effectively aligning shallow and deep features. Then, a novel Mixture-of-Experts Contrastive Recognition Module (MCRM) is designed to perform hierarchical contrastive learning and classification across multi-level features using a mixture-of-experts (MoE) mechanism, further improving class separability and recognition performance. More importantly, we also curate a large-scale and meticulously annotated liver ultrasound dataset containing six standard planes. Extensive experimental results on our in-house dataset and two public datasets demonstrate that SEMC outperforms recent state-of-the-art methods across various metrics.

[303] Through-Foliage Surface-Temperature Reconstruction for early Wildfire Detection

Mohamed Youssef, Lukas Brunner, Klaus Rundhammer, Gerald Czech, Oliver Bimber

Main category: cs.CV

TL;DR: A novel method combining signal processing and machine learning to reconstruct surface temperatures through forest vegetation for automated wildfire monitoring using drones, enabling early fire detection before visible signs appear.

DetailsMotivation: To enable early detection of ground fires through forest canopy before smoke or flames are visible, addressing limitations of synthetic aperture sensing which introduces thermal blur that obscures actual surface temperatures.

Method: Train a visual state space model to recover subtle thermal signals from blurred SA data, using a latent diffusion model in vector quantized framework to generate realistic surface temperature simulations from real wildfire recordings, with temperature augmentation and procedural thermal forest simulation.

Result: On simulated data, reduced RMSE by 2-2.5x compared to conventional thermal and uncorrected SA imaging. In field experiments for high-temperature hotspots, achieved 12.8x RMSE gain over conventional thermal and 2.6x gain over uncorrected SA. Successfully reconstructed complete morphology of fire and human signatures.

Conclusion: The method enables accurate temperature reconstruction through occluding vegetation, outperforming conventional approaches significantly, and demonstrates generalization to other thermal signals like human signatures for search and rescue applications.

Abstract: We introduce a novel method for reconstructing surface temperatures through occluding forest vegetation by combining signal processing and machine learning. Our goal is to enable fully automated aerial wildfire monitoring using autonomous drones, allowing for the early detection of ground fires before smoke or flames are visible. While synthetic aperture (SA) sensing mitigates occlusion from the canopy and sunlight, it introduces thermal blur that obscures the actual surface temperatures. To address this, we train a visual state space model to recover the subtle thermal signals of partially occluded soil and fire hotspots from this blurred data. A key challenge was the scarcity of real-world training data. We overcome this by integrating a latent diffusion model into a vector quantized to generated a large volume of realistic surface temperature simulations from real wildfire recordings, which we further expanded through temperature augmentation and procedural thermal forest simulation. On simulated data across varied ambient and surface temperatures, forest densities, and sunlight conditions, our method reduced the RMSE by a factor of 2 to 2.5 compared to conventional thermal and uncorrected SA imaging. In field experiments focused on high-temperature hotspots, the improvement was even more significant, with a 12.8-fold RMSE gain over conventional thermal and a 2.6-fold gain over uncorrected SA images. We also demonstrate our model’s generalization to other thermal signals, such as human signatures for search and rescue. Since simple thresholding is frequently inadequate for detecting subtle thermal signals, the morphological characteristics are equally essential for accurate classification. Our experiments demonstrated another clear advantage: we reconstructed the complete morphology of fire and human signatures, whereas conventional imaging is defeated by partial occlusion.

[304] Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

Jiayi Zhu, Yihao Huang, Yue Cao, Xiaojun Jia, Qing Guo, Felix Juefei-Xu, Geguang Pu, Bin Wang

Main category: cs.CV

TL;DR: Typographical attacks using deceptive text can effectively protect geo-privacy against LVLMs while preserving image quality.

DetailsMotivation: LVLMs pose serious privacy threats by inferring geolocation from images, and existing adversarial perturbations degrade visual quality too much.

Method: Two-stage, semantics-aware typographical attack that generates deceptive text extensions outside visual content.

Result: Significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs across three datasets.

Conclusion: Provides practical and visually-preserving protection strategy against emerging geo-privacy threats from LVLMs.

Abstract: Large Visual Language Models (LVLMs) now pose a serious yet overlooked privacy threat, as they can infer a social media user’s geolocation directly from shared images, leading to unintended privacy leakage. While adversarial image perturbations provide a potential direction for geo-privacy protection, they require relatively strong distortions to be effective against LVLMs, which noticeably degrade visual quality and diminish an image’s value for sharing. To overcome this limitation, we identify typographical attacks as a promising direction for protecting geo-privacy by adding text extension outside the visual content. We further investigate which textual semantics are effective in disrupting geolocation inference and design a two-stage, semantics-aware typographical attack that generates deceptive text to protect user privacy. Extensive experiments across three datasets demonstrate that our approach significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs, establishing a practical and visually-preserving protection strategy against emerging geo-privacy threats.

[305] TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang, Zuxuan Wu, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TempoMaster generates long videos through next-frame-rate prediction, first creating low-frame-rate blueprints then progressively increasing frame rates for refinement.

DetailsMotivation: To achieve long-range temporal coherence in video generation while maintaining efficient and parallel synthesis capabilities.

Method: Formulates video generation as next-frame-rate prediction, using bidirectional attention within frame-rate levels and autoregression across frame rates.

Result: Establishes new state-of-the-art in long video generation with excellent visual and temporal quality.

Conclusion: TempoMaster’s hierarchical frame-rate approach effectively balances temporal coherence and generation efficiency for long videos.

Abstract: We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

[306] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting

Zuqi Huang, Mengxin Tian, Huan Liu, Wentao Li, Baobao Liang, Jie Wu, Fang Yan, Zhaoqing Tang, Zhongyu Li

Main category: cs.CV

TL;DR: CountIHC is a novel framework for multi-class cell counting in IHC images that selectively distills knowledge from multiple foundation models using rank-aware teacher selection and vision-language alignment for improved accuracy.

DetailsMotivation: Current regression-based counting methods struggle with multi-class counting and don't effectively leverage foundation models. IHC image analysis faces challenges from chromogen overlap, variable staining, and diverse cellular morphologies.

Method: Proposes rank-aware agglomeration with RATS strategy for sample-wise teacher selection based on global-to-local patch rankings. Uses vision-language alignment with discrete semantic anchors from structured text prompts to guide class-specific density map regression.

Result: Outperforms state-of-the-art methods across 12 IHC biomarkers and 5 tissue types, showing high agreement with pathologists’ assessments. Also effective on H&E-stained data, demonstrating scalability.

Conclusion: CountIHC successfully addresses multi-class cell counting challenges in IHC images through selective knowledge distillation and vision-language alignment, achieving superior performance and clinical relevance.

Abstract: Accurate cell counting in immunohistochemistry (IHC) images is critical for quantifying protein expression and aiding cancer diagnosis. However, the task remains challenging due to the chromogen overlap, variable biomarker staining, and diverse cellular morphologies. Regression-based counting methods offer advantages over detection-based ones in handling overlapped cells, yet rarely support end-to-end multi-class counting. Moreover, the potential of foundation models remains largely underexplored in this paradigm. To address these limitations, we propose a rank-aware agglomeration framework that selectively distills knowledge from multiple strong foundation models, leveraging their complementary representations to handle IHC heterogeneity and obtain a compact yet effective student model, CountIHC. Unlike prior task-agnostic agglomeration strategies that either treat all teachers equally or rely on feature similarity, we design a Rank-Aware Teacher Selecting (RATS) strategy that models global-to-local patch rankings to assess each teacher’s inherent counting capacity and enable sample-wise teacher selection. For multi-class cell counting, we introduce a fine-tuning stage that reformulates the task as vision-language alignment. Discrete semantic anchors derived from structured text prompts encode both category and quantity information, guiding the regression of class-specific density maps and improving counting for overlapping cells. Extensive experiments demonstrate that CountIHC surpasses state-of-the-art methods across 12 IHC biomarkers and 5 tissue types, while exhibiting high agreement with pathologists’ assessments. Its effectiveness on H&E-stained data further confirms the scalability of the proposed method.

[307] Fine-Grained Representation for Lane Topology Reasoning

Guoqing Xu, Yiheng Li, Yang Yang

Main category: cs.CV

TL;DR: TopoFG is a fine-grained lane topology reasoning framework that improves autonomous driving navigation by using hierarchical priors, region-focused decoding, and robust boundary-point topology reasoning to accurately model complex lane structures.

DetailsMotivation: Existing methods struggle to accurately model complex lane structures with single-query representations, leading to unreliable topology predictions that impact autonomous driving decisions.

Method: Divides topology prediction into three phases: Hierarchical Prior Extractor (HPE) for global spatial and local sequential priors, Region-Focused Decoder (RFD) for fine-grained query construction, and Robust Boundary-Point Topology Reasoning (RBTR) with denoising strategy.

Result: Achieves state-of-the-art performance on OpenLane-V2 benchmark with OLS of 48.0% on subsetA and 45.4% on subsetB.

Conclusion: Integrating spatial and sequential priors into fine-grained queries with boundary-point topology denoising enables precise modeling of complex lane structures and trustworthy topology predictions.

Abstract: Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions.Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries.However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction.In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG).It divides the procedure from bird’s-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR).Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling.RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane.RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity.By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions.Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0% on subsetA and 45.4% on subsetB.

[308] Seg-VAR: Image Segmentation with Visual Autoregressive Modeling

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Hengshuang Zhao

Main category: cs.CV

TL;DR: Seg-VAR rethinks segmentation as a conditional autoregressive mask generation problem using latent learning instead of discriminative learning, achieving state-of-the-art performance across various segmentation tasks.

DetailsMotivation: Visual autoregressive modeling (VAR) strategies have shown success in image generation but their potential for segmentation tasks requiring precise spatial perception remains unexplored. The authors aim to bridge this gap by applying autoregressive reasoning to segmentation.

Method: Three core components: (1) image encoder generating latent priors, (2) spatial-aware seglat encoder mapping masks to discrete latent tokens using location-sensitive color mapping, (3) decoder reconstructing masks from latents. Multi-stage training: first learning seglat representations, then refining latent transformations, finally aligning image-encoder-derived latents with seglat distributions.

Result: Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks.

Conclusion: By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems.

Abstract: While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. Code will be available at https://github.com/rkzheng99/Seg-VAR.

[309] LoRA-Enhanced Vision Transformer for Single Image based Morphing Attack Detection via Knowledge Distillation from EfficientNet

Ria Shekhawat, Sushrut Patwardhan, Raghavendra Ramachandra, Praveen Kumar Chandaliya, Kishor P. Upla

Main category: cs.CV

TL;DR: A teacher-student framework combining CNN and ViT with LoRA fine-tuning achieves efficient and accurate single-image morphing attack detection.

DetailsMotivation: Face recognition systems are vulnerable to morphing attacks that blend biometric features from multiple individuals, requiring robust detection methods.

Method: Proposes a teacher-student framework where a CNN teacher refines a ViT student model, integrated with Low-Rank Adaptation (LoRA) for efficient fine-tuning.

Result: Superior detection performance and computational efficiency compared to six state-of-the-art S-MAD techniques on a comprehensive morphing dataset.

Conclusion: The proposed approach effectively addresses morphing attack detection with high accuracy and reduced computational costs.

Abstract: Face Recognition Systems (FRS) are critical for security but remain vulnerable to morphing attacks, where synthetic images blend biometric features from multiple individuals. We propose a novel Single-Image Morphing Attack Detection (S-MAD) approach using a teacher-student framework, where a CNN-based teacher model refines a ViT-based student model. To improve efficiency, we integrate Low-Rank Adaptation (LoRA) for fine-tuning, reducing computational costs while maintaining high detection accuracy. Extensive experiments are conducted on a morphing dataset built from three publicly available face datasets, incorporating ten different morphing generation algorithms to assess robustness. The proposed method is benchmarked against six state-of-the-art S-MAD techniques, demonstrating superior detection performance and computational efficiency.

[310] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

Main category: cs.CV

TL;DR: SoccerNet-GAR is a multimodal dataset for group activity recognition in football, comparing video vs tracking modalities. Tracking-based GAR with role-aware graph networks outperforms video-based methods with significantly better efficiency.

DetailsMotivation: Current GAR research lacks standardized benchmarks comparing video and tracking modalities for the same activities. Tracking data provides compact, agent-centric signals that explicitly encode spatial interactions but remains under-explored compared to video.

Method: Introduced SoccerNet-GAR dataset with synchronized broadcast video and player tracking data from 64 World Cup 2022 matches. Developed unified evaluation protocol and two unimodal approaches: competitive video-based classifiers and novel role-aware graph neural networks for tracking-based GAR.

Result: Tracking model achieved 67.2% balanced accuracy vs 58.1% for best video baseline, while training 4.25x faster with 438x fewer parameters (197K vs 86.3M).

Conclusion: Tracking-based GAR with role-aware modeling significantly outperforms video-based approaches in both accuracy and efficiency, highlighting the importance of modality choice and tactical structure encoding for group activity recognition.

Abstract: Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2%$ balanced accuracy compared to $58.1%$ for the best video baseline, while training $4.25 \times$ faster with $438 \times$ fewer parameters ($197K$ \vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.

[311] Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine

Ziqiong Liu, Yushun Tang, Junyang Ji, Zhihai He

Main category: cs.CV

TL;DR: Proposes Hierarchical Ladder Network for OOD detection and Attention Affine Network for domain adaptation in test-time adaptation, improving robustness against domain shifts and OOD samples.

DetailsMotivation: Existing TTA methods suffer performance drops when encountering OOD samples, which can mislead adaptation and degrade accuracy on subsequent ID samples.

Method: Uses Hierarchical Ladder Network to extract OOD features from Transformer class tokens, Attention Affine Network to refine self-attention for domain drift, and weighted entropy to suppress low-confidence samples.

Result: Significantly improves performance on benchmark classification datasets compared to existing methods.

Conclusion: The proposed hierarchical approach effectively addresses OOD detection and domain adaptation challenges in test-time adaptation, enhancing model robustness and performance.

Abstract: Test-time adaptation (TTA) refers to adjusting the model during the testing phase to cope with changes in sample distribution and enhance the model’s adaptability to new environments. In real-world scenarios, models often encounter samples from unseen (out-of-distribution, OOD) categories. Misclassifying these as known (in-distribution, ID) classes not only degrades predictive accuracy but can also impair the adaptation process, leading to further errors on subsequent ID samples. Many existing TTA methods suffer substantial performance drops under such conditions. To address this challenge, we propose a Hierarchical Ladder Network that extracts OOD features from class tokens aggregated across all Transformer layers. OOD detection performance is enhanced by combining the original model prediction with the output of the Hierarchical Ladder Network (HLN) via weighted probability fusion. To improve robustness under domain shift, we further introduce an Attention Affine Network (AAN) that adaptively refines the self-attention mechanism conditioned on the token information to better adapt to domain drift, thereby improving the classification performance of the model on datasets with domain shift. Additionally, a weighted entropy mechanism is employed to dynamically suppress the influence of low-confidence samples during adaptation. Experimental results on benchmark datasets show that our method significantly improves the performance on the most widely used classification datasets.

[312] OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

Artem Moroz, Vít Zeman, Martin Mikšík, Elizaveta Isianova, Miroslav David, Pavel Burget, Varun Burde

Main category: cs.CV

TL;DR: A unified framework combining object detection and pose estimation with flexible onboarding from CAD models or neural reconstruction, using CNOS detector and OPFormer transformer for robust 6D pose estimation.

DetailsMotivation: To create an end-to-end system that handles both object detection and pose estimation while supporting flexible object representation onboarding from either traditional CAD models or neural reconstructions when models are unavailable.

Method: Pipeline with onboarding stage generating object representations from CAD or NeRF reconstruction, CNOS detector for object localization, and OPFormer transformer architecture that encodes multiple template views with 3D geometric priors using NOCS for robust 2D-3D correspondences.

Result: Demonstrates strong balance between accuracy and efficiency on BOP benchmarks, showing practical applicability in both model-based and model-free scenarios.

Conclusion: The integrated system provides a versatile solution for 6D pose estimation that works effectively with both traditional CAD models and neural reconstructions, achieving competitive performance on challenging benchmarks.

Abstract: We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.

[313] Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng, Xueqi Li, Chun Yu, Junliang Xing

Main category: cs.CV

TL;DR: MDiTFace is a diffusion transformer framework that uses unified tokenization and multivariate transformer blocks for multimodal facial generation, with a decoupled attention mechanism that reduces computational overhead by 94% while maintaining performance.

DetailsMotivation: Conventional feature fusion approaches fail to enable effective cross-modal interactions between semantic masks and textual descriptions, leading to suboptimal generation outcomes in multimodal facial generation.

Method: Uses unified tokenization for semantic masks and text inputs, stacked multivariate transformer blocks for synchronous condition processing, and a novel decoupled attention mechanism that separates mask tokens from temporal embeddings into dynamic and static pathways.

Result: Significantly outperforms competing methods in facial fidelity and conditional consistency while reducing computational overhead introduced by mask condition by over 94%.

Conclusion: MDiTFace effectively addresses cross-modal interaction challenges in multimodal facial generation through unified tokenization and decoupled attention, achieving superior performance with significantly reduced computational costs.

Abstract: While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace–a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

[314] Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang, Xuebo Wang, Xin Tao, Qi Fan

Main category: cs.CV

TL;DR: Denoising-VAE addresses the optimization dilemma in high-dimensional latent spaces by suppressing redundant high-frequency noise, enabling faster diffusion model convergence and improved generation quality without relying on external vision foundation models.

DetailsMotivation: VAEs face a trade-off where higher-dimensional latent spaces improve reconstruction but harm generative performance. Current methods use external vision models for regularization, but the fundamental impact of high-dimensional latents on generative model optimization remains unclear.

Method: Proposes spectral self-regularization to suppress redundant high-frequency noise in latent spaces while preserving reconstruction quality. Also introduces spectral alignment strategy to optimize Denoising-VAE-based generative models. Uses ViT-based autoencoder architecture without VFM dependency.

Result: Enables diffusion models to converge 2× faster than SD-VAE. Achieves state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on ImageNet 256×256 benchmark.

Conclusion: The analysis reveals that redundant high-frequency components in high-dimensional latent spaces hinder diffusion model convergence. Denoising-VAE effectively addresses this by producing cleaner latents, improving both reconstruction and generation while accelerating training.

Abstract: Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

[315] Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

Ye Du, Nanxi Yu, Shujun Wang

Main category: cs.CV

TL;DR: CILMP is a novel prompt tuning method that incorporates Large Language Models (LLMs) to enhance medical image classification by generating disease-specific, instance-adaptive prompts for Vision-Language Models (VLMs).

DetailsMotivation: Existing prompt tuning methods cannot precisely distinguish different medical concepts and miss disease-specific features across medical imaging modalities. LLMs possess specialized medical knowledge that can improve VLM performance in medical tasks.

Method: CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and uses them to create disease-specific prompts with a conditional mechanism for instance-adaptive prompt generation.

Result: Extensive experiments across diverse medical image datasets show that CILMP consistently outperforms state-of-the-art prompt tuning methods.

Conclusion: CILMP effectively bridges LLMs and VLMs to transfer medical knowledge into VLM prompts, demonstrating superior performance in medical image classification tasks.

Abstract: Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

[316] DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry

Cheng Liao

Main category: cs.CV

TL;DR: DPVO-QAT++ is a hierarchical quantization optimization framework that combines heterogeneous precision design and GPU kernel fusion to significantly improve computational efficiency of deep visual SLAM systems while maintaining accuracy.

DetailsMotivation: Deep learning-based Visual SLAM systems have excellent geometric reasoning but suffer from prohibitive computational overhead that limits deployment on resource-constrained autonomous platforms.

Method: Uses learnable scale parameterization, heterogeneous precision design (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization with custom CUDA kernels.

Result: On TartanAir: 52.1% FPS increase, 29.1% latency reduction, 64.9% GPU memory reduction. On EuRoC: 30.1% FPS increase, 23.1% latency reduction, 37.7% GPU memory reduction, while maintaining comparable trajectory accuracy to original model.

Conclusion: DPVO-QAT++ effectively bridges the gap between high-precision deep VO and practical deployment efficiency, offering a viable engineering paradigm for real-world embedded platforms.

Abstract: Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion.

[317] Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

Zeqin Yu, Haotao Xie, Jian Zhang, Jiangqun Ni, Wenkan Su, Jiwu Huang

Main category: cs.CV

TL;DR: FSTS is a Fourier Series-based Tampering Synthesis framework that generates realistic training data for text image forgery localization by modeling real-world tampering behaviors from collected editing traces.

DetailsMotivation: Existing T-IFL methods suffer from poor generalization due to limited real-world datasets and synthetic data that doesn't capture real tampering complexity.

Method: Collects 16,750 real tampering instances, analyzes editing traces, and models tampering parameters hierarchically using Fourier series-inspired basis functions to synthesize realistic training data.

Result: Models trained with FSTS data achieve significantly improved generalization on real-world datasets across four evaluation protocols.

Conclusion: FSTS provides an interpretable framework for generating diverse and realistic tampered text images that better reflect real-world forgery traces, improving model generalization.

Abstract: Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

[318] Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

Hongbin Huang, Junwei Li, Tianxin Xie, Zhuang Li, Cekai Weng, Yaodong Yang, Yue Luo, Li Liu, Jing Tang, Zhijing Shao, Zeyu Wang

Main category: cs.CV

TL;DR: A real-time conversational digital human system combining realistic 3D avatars, expressive speech synthesis, and knowledge-grounded dialogue with low-latency asynchronous pipeline.

DetailsMotivation: Achieving both visual realism and real-time responsiveness for interactive digital humans in communication, education, and entertainment applications.

Method: Asynchronous execution pipeline coordinating multi-modal components, retrieval-augmented methods with history augmentation and intent-based routing, wake word detection, and emotionally expressive prosody.

Result: Integrated system enabling responsive and believable digital humans with minimal latency, supporting natural conversational flow and context-aware response generation.

Conclusion: The system successfully combines visual realism with real-time performance, making high-fidelity digital humans suitable for immersive interactive applications.

Abstract: High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

[319] DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality

Tushar Anand, Advik Sinha, Abhijit Das

Main category: cs.CV

TL;DR: Proposes a real-time optical flow and disparity estimation model using non-causal selective state space fusion for dense perception tasks.

DetailsMotivation: Need for fast and efficient models that can handle real-time constraints while maintaining high accuracy for optical flow and disparity estimation in 3D dense perception.

Method: Uses non-causal Mamba block-based model with pairwise input image fusion in selective state space for dense perception tasks.

Result: Reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation.

Conclusion: The proposed model is suitable for unified real-time and accurate 3D dense perception estimation tasks, validated in real-life scenarios.

Abstract: In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

[320] Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

Saar Stern, Ido Sobol, Or Litany

Main category: cs.CV

TL;DR: Proposes a task-aware evaluation framework for Novel View Synthesis (NVS) using features from Zero123 foundation model, with two metrics (D_PRISM and MMD_PRISM) that reliably assess synthesis quality and align with human preferences.

DetailsMotivation: Existing evaluation metrics for NVS struggle to assess whether generated images are both realistic and faithful to source view and viewpoint transformation, often mis-ranking incorrect results due to inability to capture nuanced relationships.

Method: Leverages features from Zero123 foundation model with lightweight tuning, introducing two metrics: reference-based D_PRISM and reference-free MMD_PRISM that use these enhanced features to evaluate synthesis quality.

Result: Both metrics reliably identify incorrect generations and rank models in agreement with human preference studies. MMD_PRISM produces clear and stable rankings across three benchmarks (Toys4K, GSO, OmniObject3D) with lower scores indicating stronger models.

Conclusion: The framework provides a principled and practical approach to assessing NVS quality, addressing a fundamental gap in evaluation and enabling more reliable progress in novel view synthesis.

Abstract: The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{\text{PRISM}}$, and a reference-free score, $\text{MMD}{\text{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $\text{MMD}{\text{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.

[321] BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere

Main category: cs.CV

TL;DR: BridgeEQA: A benchmark for embodied question answering in infrastructure inspection using real-world bridge scenes with professional inspection reports and NBI condition ratings.

DetailsMotivation: Current embodied agents struggle with realistic real-world settings; infrastructure inspection provides a compelling domain requiring multi-scale reasoning, long-range spatial understanding, and complex semantic relationships.

Method: Proposed Embodied Memory Visual Reasoning (EMVR) that formulates inspection as sequential navigation over an image-based scene graph, treating images as nodes and using Markov decision process for traversal and reasoning.

Result: Evaluations show substantial performance gaps in state-of-the-art models; EMVR demonstrates strong performance over baselines on the BridgeEQA benchmark with 2,200 question-answer pairs across 200 bridge scenes.

Conclusion: BridgeEQA provides a faithful benchmark for open-vocabulary embodied question answering, and EMVR effectively addresses the challenges through sequential navigation and reasoning over visual evidence.

Abstract: Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

[322] R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shuaike Shen, Ke Liu, Jiaqing Xie, Shangde Gao, Chunhua Shen, Ge Liu, Mireia Crispin-Ortuzar, Shangqi Gao

Main category: cs.CV

TL;DR: R²Seg is a training-free framework for robust out-of-distribution tumor segmentation that uses LLM-guided anatomical reasoning and statistical testing to suppress false positives without parameter updates.

DetailsMotivation: Foundation models for medical image segmentation struggle with out-of-distribution shifts, often producing fragmented false positives on OOD tumors.

Method: Two-stage Reason-and-Reject process: 1) LLM-guided anatomical reasoning planner localizes organ anchors and generates multi-scale ROIs, 2) Statistical rejection filter applies two-sample testing to candidates from frozen foundation model, retaining only those significantly different from normal tissue.

Result: Substantially improves Dice, specificity, and sensitivity on multi-center and multi-modal tumor segmentation benchmarks over strong baselines and original foundation models.

Conclusion: R²Seg provides effective OOD tumor segmentation without parameter updates, avoiding catastrophic forgetting and maintaining compatibility with zero-update test-time augmentation.

Abstract: Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R$^{2}$Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R$^{2}$Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models. Code are available at https://github.com/Eurekashen/R2Seg.

[323] HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models

Sushant Gautam, Michael A. Riegler, Pål Halvorsen

Main category: cs.CV

TL;DR: HEDGE is a unified framework for detecting hallucinations in vision-language models that combines visual perturbations, semantic clustering, and uncertainty metrics to evaluate multimodal reliability across different architectures and prompt designs.

DetailsMotivation: Vision-language models are prone to hallucinations despite enabling open-ended visual question answering, requiring systematic detection methods to evaluate their reliability.

Method: Combines controlled visual perturbations, semantic clustering (entailment- and embedding-based), and robust uncertainty metrics in a reproducible pipeline applicable across multimodal architectures.

Result: Hallucination detectability varies by architecture and prompt design - highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for restricted tokenization models (Med-Gemma). VASE metric consistently provides the most robust hallucination signal when paired with embedding clustering.

Conclusion: HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability by framing hallucination detection as a geometric robustness problem shaped by sampling scale, prompt structure, model architecture, and clustering strategy.

Abstract: Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .

[324] X-VMamba: Explainable Vision Mamba

Mohamed A. Mabrok, Yalda Zafari

Main category: cs.CV

TL;DR: A controllability-based interpretability framework for State Space Models (SSMs) that quantifies how input tokens influence internal state dynamics, enabling transparent analysis of hierarchical feature processing in vision SSMs.

DetailsMotivation: Understanding how Vision SSMs process spatial information is challenging due to lack of transparent attention mechanisms, creating a gap in interpretability for these powerful sequence modeling alternatives to Transformers.

Method: Two complementary formulations: Jacobian-based method for any SSM architecture measuring influence through state propagation chain, and Gramian-based approach for diagonal SSMs with closed-form analytical solutions for superior speed. Both operate in single forward pass with linear complexity.

Result: SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Revealed domain-specific controllability signatures, progressive spatial selectivity, and influence of scanning strategies on attention patterns.

Conclusion: Controllability analysis establishes a unified interpretability paradigm for SSMs across domains, with applications spanning computer vision, NLP, and cross-domain tasks.

Abstract: State Space Models (SSMs), particularly the Mamba architecture, have recently emerged as powerful alternatives to Transformers for sequence modeling, offering linear computational complexity while achieving competitive performance. Yet, despite their effectiveness, understanding how these Vision SSMs process spatial information remains challenging due to the lack of transparent, attention-like mechanisms. To address this gap, we introduce a controllability-based interpretability framework that quantifies how different parts of the input sequence (tokens or patches) influence the internal state dynamics of SSMs. We propose two complementary formulations: a Jacobian-based method applicable to any SSM architecture that measures influence through the full chain of state propagation, and a Gramian-based approach for diagonal SSMs that achieves superior speed through closed-form analytical solutions. Both methods operate in a single forward pass with linear complexity, requiring no architectural modifications or hyperparameter tuning. We validate our framework through experiments on three diverse medical imaging modalities, demonstrating that SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Our analysis reveals domain-specific controllability signatures aligned with diagnostic criteria, progressive spatial selectivity across the network hierarchy, and the substantial influence of scanning strategies on attention patterns. Beyond medical imaging, we articulate applications spanning computer vision, natural language processing, and cross-domain tasks. Our framework establishes controllability analysis as a unified, foundational interpretability paradigm for SSMs across all domains. Code and analysis tools will be made available upon publication

[325] Counting Through Occlusion: Framework for Open World Amodal Counting

Safaeid Hossain Arib, Rabeya Akter, Abdul Monaf Chowdhury, Md Jubair Ahmed Sourov, Md Mehedi Hasan

Main category: cs.CV

TL;DR: CountOCC is an amodal counting framework that reconstructs occluded object features using multimodal guidance to address counting failures under occlusion, achieving state-of-the-art performance across multiple datasets.

DetailsMotivation: Current object counting methods fail under occlusion because backbone networks encode occluding surfaces instead of target objects, corrupting feature representations needed for accurate counting.

Method: CountOCC reconstructs occluded object features by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features across pyramid levels. It also uses a visual equivalence objective to enforce attention consistency between occluded and unoccluded views.

Result: CountOCC achieves 26.72% and 20.80% MAE reduction on FSC 147 validation and test sets under occlusion, 49.89% MAE reduction on CARPK, and 28.79% MAE reduction on CAPTUREReal, setting new SOTA across diverse visual domains.

Conclusion: CountOCC effectively addresses occlusion challenges in object counting through hierarchical multimodal feature reconstruction and attention consistency enforcement, demonstrating robust amodal counting performance across various real-world scenarios.

Abstract: Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.

[326] FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

Kaiser Hamid, Can Cui, Khandakar Ashrafi Akbar, Ziran Wang, Nade Liang

Main category: cs.CV

TL;DR: FSDAM enables joint driver attention prediction and caption generation using only ~100 annotated examples, achieving competitive performance with minimal supervision and robust zero-shot generalization.

DetailsMotivation: Existing driver attention models require large-scale gaze datasets that are labor-intensive to collect and curate, limiting practical deployment in data-constrained scenarios.

Method: Dual-pathway architecture with separate modules for spatial attention prediction and caption generation, maintaining semantic consistency through cross-modal alignment.

Result: Achieves competitive attention prediction performance, generates coherent context-aware explanations, and demonstrates robust zero-shot generalization across multiple driving benchmarks.

Conclusion: Effective attention-conditioned generation is achievable with limited supervision, opening possibilities for practical deployment of explainable driver attention systems.

Abstract: Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

[327] Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

Ankita Raj, Chetan Arora

Main category: cs.CV

TL;DR: First study of backdoor attacks on open-vocabulary object detectors using prompt tuning, proposing TrAP method that jointly optimizes image/text prompts with visual triggers for effective backdoor injection without retraining base model weights.

DetailsMotivation: As open-vocabulary object detectors gain traction in high-stakes applications like robotics and autonomous driving, understanding their security risks becomes crucial, particularly the new attack surface introduced by prompt tuning.

Method: Proposed TrAP (Trigger-Aware Prompt tuning) - a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers, using curriculum-based training to progressively shrink trigger size.

Result: Achieves high attack success rates for both object misclassification and disappearance attacks across multiple datasets, while improving clean image performance on downstream datasets compared to zero-shot setting.

Conclusion: Demonstrates significant security vulnerability in open-vocabulary object detectors through prompt tuning backdoor attacks, highlighting the need for robust security measures in multi-modal vision-language systems.

Abstract: Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

[328] Direct Visual Grounding by Directing Attention of Visual Tokens

Parsa Esmaeilkhani, Longin Jan Latecki

Main category: cs.CV

TL;DR: VLMs fail to properly attend to relevant visual tokens when answering questions, leading to wrong answers. A novel KL attention loss is proposed to directly supervise visual attention, improving performance on visual tasks.

DetailsMotivation: VLMs treat visual and language tokens equally in attention layers, causing relevant visual tokens to receive little attention from answer tokens, which results in incorrect answers to visual questions.

Method: Propose a KL attention loss that directly supervises visual token attention by aligning attention distributions to ground truth maps from task geometry or grounding annotations, combined with standard next-token prediction loss.

Result: Significant improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data. Also introduced a new line tracing dataset where even commercial VLMs perform poorly.

Conclusion: Direct supervision of visual token attention through KL attention loss effectively grounds answer language tokens in images and improves VLM performance on visual reasoning tasks.

Abstract: Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual tokens. It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens. This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence. The ground truth attention maps are obtained from task geometry in synthetic cases or from standard grounding annotations (e.g., bounding boxes or point annotations) in real images, and are used inside the LLM for attention supervision without requiring new labels. The obtained KL attention loss (KLAL) when combined with NTP encourages VLMs to attend to relevant visual tokens while generating answer tokens. This results in notable improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data, as demonstrated by our experiments. We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.

[329] Deep Imbalanced Multi-Target Regression: 3D Point Cloud Voxel Content Estimation in Simulated Forests

Amirhossein Hassanzadeh, Bartosz Krawczyk, Michael Saunders, Rob Wible, Keith Krause, Dimah Dera, Jan van Aardt

Main category: cs.CV

TL;DR: This study explores inferring low-level voxel content (target occupancy percentages) from high-level voxelized LiDAR data using multi-target regression with KPConv, addressing class imbalance through cost-sensitive learning and analyzing voxel size sensitivity.

DetailsMotivation: Voxelization reduces computational cost for LiDAR data but loses fine-scale structural information. The research aims to recover target occupancy percentages within voxels from voxelized point cloud data.

Method: Proposed multi-target regression using Kernel Point Convolutions (KPConv) with cost-sensitive learning (density-based relevance), weighted MSE, Focal Regression, and regularization. Performed sensitivity analysis on voxel sizes (0.25-2 meters).

Result: Larger voxel sizes (2m) result in lower errors due to reduced variability, while smaller voxel sizes (0.25-0.5m) show higher errors, especially in canopy areas. Bark and leaf targets had significantly higher errors at fine resolutions.

Conclusion: Voxel size choice is application-dependent. The work fills gaps in deep imbalance learning for multi-target regression and simulated datasets for 3D LiDAR forest point clouds.

Abstract: Voxelization is an effective approach to reduce the computational cost of processing Light Detection and Ranging (LiDAR) data, yet it results in a loss of fine-scale structural information. This study explores whether low-level voxel content information, specifically target occupancy percentage within a voxel, can be inferred from high-level voxelized LiDAR point cloud data collected from Digital Imaging and remote Sensing Image Generation (DIRSIG) software. In our study, the targets include bark, leaf, soil, and miscellaneous materials. We propose a multi-target regression approach in the context of imbalanced learning using Kernel Point Convolutions (KPConv). Our research leverages cost-sensitive learning to address class imbalance called density-based relevance (DBR). We employ weighted Mean Saquared Erorr (MSE), Focal Regression (FocalR), and regularization to improve the optimization of KPConv. This study performs a sensitivity analysis on the voxel size (0.25 - 2 meters) to evaluate the effect of various grid representations in capturing the nuances of the forest. This sensitivity analysis reveals that larger voxel sizes (e.g., 2 meters) result in lower errors due to reduced variability, while smaller voxel sizes (e.g., 0.25 or 0.5 meter) exhibit higher errors, particularly within the canopy, where variability is greatest. For bark and leaf targets, error values at smaller voxel size datasets (0.25 and 0.5 meter) were significantly higher than those in larger voxel size datasets (2 meters), highlighting the difficulty in accurately estimating within-canopy voxel content at fine resolutions. This suggests that the choice of voxel size is application-dependent. Our work fills the gap in deep imbalance learning models for multi-target regression and simulated datasets for 3D LiDAR point clouds of forests.

[330] Which Way from B to A: The role of embedding geometry in image interpolation for Stable Diffusion

Nicholas Karris, Luke Durell, Javier Flores, Tegan Emerson

Main category: cs.CV

TL;DR: Stable Diffusion embeddings can be viewed as point clouds in Wasserstein space rather than matrices, enabling optimal transport-based interpolation that produces smoother image transitions.

DetailsMotivation: The permutation-invariance property of Stable Diffusion with respect to CLIP embeddings suggests they should be treated as point clouds rather than matrices, opening new geometric perspectives.

Method: Reframe interpolation as an optimal transport problem to compute geodesics between embeddings, treating them as point clouds in Wasserstein space rather than using standard matrix interpolation.

Result: Optimal transport-based interpolation produces smoother and more coherent intermediate images compared to standard interpolation methods in Stable Diffusion.

Conclusion: Viewing embeddings as point clouds rather than matrices better captures the geometry of embedding space and enables more natural interpolations through optimal transport.

Abstract: It can be shown that Stable Diffusion has a permutation-invariance property with respect to the rows of Contrastive Language-Image Pretraining (CLIP) embedding matrices. This inspired the novel observation that these embeddings can naturally be interpreted as point clouds in a Wasserstein space rather than as matrices in a Euclidean space. This perspective opens up new possibilities for understanding the geometry of embedding space. For example, when interpolating between embeddings of two distinct prompts, we propose reframing the interpolation problem as an optimal transport problem. By solving this optimal transport problem, we compute a shortest path (or geodesic) between embeddings that captures a more natural and geometrically smooth transition through the embedding space. This results in smoother and more coherent intermediate (interpolated) images when rendered by the Stable Diffusion generative model. We conduct experiments to investigate this effect, comparing the quality of interpolated images produced using optimal transport to those generated by other standard interpolation methods. The novel optimal transport–based approach presented indeed gives smoother image interpolations, suggesting that viewing the embeddings as point clouds (rather than as matrices) better reflects and leverages the geometry of the embedding space.

[331] SAGE: Saliency-Guided Contrastive Embeddings

Colton R. Crum, Adam Czajka

Main category: cs.CV

TL;DR: SAGE (Saliency-Guided Contrastive Embeddings) is a novel loss function that integrates human saliency into neural network training using contrastive embeddings in latent space rather than image space, improving classification performance and generalization.

DetailsMotivation: Existing saliency-guided training methods rely on internal model mechanisms that may be unreliable, and placing guidance solely in image space creates challenges. The insight is to use latent space embeddings instead for more effective human guidance.

Method: Proposes SAGE loss function that applies salient-preserving and saliency-degrading signal augmentations to input, captures changes in embeddings and model logits, and uses contrastive triplet loss to guide model toward salient features and away from non-salient features. Includes sanity check on logit distributions.

Result: Demonstrates boost in classification performance across both open- and closed-set scenarios against state-of-the-art saliency-based methods, showing effectiveness across various backbones and wide generalization across tasks.

Conclusion: Moving saliency guidance from image space to latent space embeddings enables more effective integration of human perceptual priors into neural network training, improving generalization and performance across diverse scenarios.

Abstract: Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model’s latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE’s effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.

[332] RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition

Cătălin-Alexandru Rîpanu, Andrei-Theodor Hotnog, Giulia-Stefania Imbrea, Dumitru-Clementin Cercel

Main category: cs.CV

TL;DR: Introduces RoCoISLR, the first large-scale Romanian Isolated Sign Language Recognition dataset with 9,000+ videos across 6,000 glosses, and benchmarks 7 state-of-the-art models showing transformer-based architectures perform best.

DetailsMotivation: Address the lack of standardized datasets for Romanian Sign Language recognition, which limits research progress compared to American Sign Language.

Method: Created RoCoISLR dataset with 9,000+ video samples across 6,000 standardized glosses, then evaluated 7 video recognition models (I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, PoseConv3D) under consistent experimental setups.

Result: Transformer-based architectures outperformed convolutional baselines, with Swin Transformer achieving 34.1% Top-1 accuracy. Results highlight challenges of long-tail class distributions in low-resource sign languages.

Conclusion: RoCoISLR provides the foundational dataset for systematic Romanian Isolated Sign Language Recognition research, enabling future work in this under-resourced domain.

Abstract: Automatic sign language recognition plays a crucial role in bridging the communication gap between deaf communities and hearing individuals; however, most available datasets focus on American Sign Language. For Romanian Isolated Sign Language Recognition (RoISLR), no large-scale, standardized dataset exists, which limits research progress. In this work, we introduce a new corpus for RoISLR, named RoCoISLR, comprising over 9,000 video samples that span nearly 6,000 standardized glosses from multiple sources. We establish benchmark results by evaluating seven state-of-the-art video recognition models-I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, and PoseConv3D-under consistent experimental setups, and compare their performance with that of the widely used WLASL2000 corpus. According to the results, transformer-based architectures outperform convolutional baselines; Swin Transformer achieved a Top-1 accuracy of 34.1%. Our benchmarks highlight the challenges associated with long-tail class distributions in low-resource sign languages, and RoCoISLR provides the initial foundation for systematic RoISLR research.

[333] Lightweight Optimal-Transport Harmonization on Edge Devices

Maria Larchenko, Dmitry Guskov, Alexander Lobashev, Georgy Derevyanko

Main category: cs.CV

TL;DR: A lightweight color harmonization method for AR using optimal transport theory, enabling real-time on-device inference.

DetailsMotivation: Color harmonization is crucial for seamless AR composites but current methods lack real-time performance needed for AR pipelines.

Method: Train a compact encoder to predict Monge-Kantorovich transport map using classical optimal transport theory.

Result: Achieves best aggregated score on real composite AR images compared to state-of-the-art methods.

Conclusion: Proposed MKL-Harmonizer enables real-time color harmonization for AR applications with released dataset and toolkit.

Abstract: Color harmonization adjusts the colors of an inserted object so that it perceptually matches the surrounding image, resulting in a seamless composite. The harmonization problem naturally arises in augmented reality (AR), yet harmonization algorithms are not currently integrated into AR pipelines because real-time solutions are scarce. In this work, we address color harmonization for AR by proposing a lightweight approach that supports on-device inference. For this, we leverage classical optimal transport theory by training a compact encoder to predict the Monge-Kantorovich transport map. We benchmark our MKL-Harmonizer algorithm against state-of-the-art methods and demonstrate that for real composite AR images our method achieves the best aggregated score. We release our dedicated AR dataset of composite images with pixel-accurate masks and data-gathering toolkit to support further data acquisition by researchers.

[334] Enhancing Neuro-Oncology Through Self-Assessing Deep Learning Models for Brain Tumor Unified Model for MRI Segmentation

Andrew Zhou

Main category: cs.CV

TL;DR: Unified uncertainty-aware framework for brain tumor segmentation that provides both tumor localization with anatomical context and voxel-wise uncertainty estimates in a single pass.

DetailsMotivation: Current deep learning methods lack uncertainty estimates for errors and fail to segment healthy brain structures around tumors, limiting clinical use for surgical planning.

Method: Augments nnUNet with a channel for voxel-wise uncertainty and combines normal and cancer datasets in a unified model to segment both tumors and healthy brain structures.

Result: Achieved 0.750 correlation and 0.047 RMSD for uncertainty estimation without compromising tumor accuracy (DSC 0.86), plus 0.81 DSC for brain structures in unified model.

Conclusion: First model providing tumor segmentation in natural surroundings with overlaid uncertainty maps, offering key insights for surgical decision-making and error correction.

Abstract: Accurate segmentation of brain tumors is vital for diagnosis, surgical planning, and treatment monitoring. Deep learning has advanced on benchmarks, but two issues limit clinical use: no uncertainty estimates for errors and no segmentation of healthy brain structures around tumors for surgery. Current methods fail to unify tumor localization with anatomical context and lack confidence scores. This study presents an uncertainty-aware framework augmenting nnUNet with a channel for voxel-wise uncertainty. Trained on BraTS2023, it yields a correlation of 0.750 and RMSD of 0.047 for uncertainty without hurting tumor accuracy. It predicts uncertainty in one pass, with no extra networks or inferences, aiding clinical decisions. For whole-brain context, a unified model combines normal and cancer datasets, achieving a DSC of 0.81 for brain structures and 0.86 for tumor, with robust key-region performance. Combining both innovations gives the first model outputting tumor in natural surroundings plus an overlaid uncertainty map. Visual checks of outputs show uncertainty offers key insights to evaluate predictions and fix errors, helping informed surgical decisions from AI.

[335] SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: SAGA is the first comprehensive framework for AI-generated video source attribution that identifies specific generative models across five granular levels, achieving state-of-the-art performance with minimal labeled data.

DetailsMotivation: The proliferation of hyper-realistic synthetic videos has escalated misuse risks and outstripped traditional binary real/fake detectors, creating an urgent need for precise source attribution at scale.

Method: Uses a novel video transformer architecture with features from a robust vision foundation model, plus a data-efficient pretrain-and-attribute strategy requiring only 0.5% source-labeled data per class, and introduces Temporal Attention Signatures for interpretability.

Result: SAGA achieves state-of-the-art attribution performance matching fully supervised methods while using minimal labeled data, and provides multi-granular attribution across authenticity, generation task, model version, development team, and precise generator.

Conclusion: SAGA sets a new benchmark for synthetic video provenance, offering crucial interpretable insights for forensic and regulatory applications through comprehensive source attribution.

Abstract: The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

[336] Video Finetuning Improves Reasoning Between Frames

Ruiqi Yang, Tian Yun, Zihan Wang, Ellie Pavlick

Main category: cs.CV

TL;DR: Video-finetuned multimodal LLMs implicitly capture temporal reasoning, while image-only models benefit from explicit transitional event descriptions (vCoT) for video understanding.

DetailsMotivation: To investigate what video finetuning brings to multimodal LLMs and understand if explicit transitional reasoning (vCoT) can bridge the gap between image-only and video-finetuned models.

Method: Proposed Visual Chain-of-Thought (vCoT) - an explicit reasoning process that generates transitional event descriptions between consecutive frames, and systematically compared image-only LVLMs with video-finetuned counterparts.

Result: vCoT significantly improves image-only models on long-form video QA but yields only marginal gains for video-finetuned models, suggesting video models already capture frame transitions implicitly. Video models also transfer temporal reasoning to static settings.

Conclusion: Video finetuning enables implicit temporal reasoning that benefits both video understanding and static relational visual reasoning tasks.

Abstract: Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models’ baselines on relational visual reasoning tasks.

[337] View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

Main category: cs.CV

TL;DR: ViCoKD is a knowledge distillation framework that transfers knowledge from a multi-modal teacher to a modality-limited student for multi-view action recognition in partially overlapping sensor setups.

DetailsMotivation: Real-world multi-sensor systems often have partial view overlap where actions are only visible in some views, and limited input modalities with sequence-level annotations rather than dense frame-level labels.

Method: Uses cross-modal adapter with cross-modal attention to exploit multi-modal correlations, and View-aware Consistency module with human-detection masks and confidence-weighted Jensen-Shannon divergence to handle view misalignment.

Result: Outperforms competitive distillation methods on MultiSensor-Home dataset across multiple backbones and environments, achieving significant gains and even surpassing the teacher model under limited conditions.

Conclusion: ViCoKD effectively addresses challenges in partially overlapping multi-view action recognition by distilling knowledge from multi-modal teachers to modality-limited students while handling view misalignment.

Abstract: The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

[338] Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

Main category: cs.CV

TL;DR: EgoLoc is a zero-shot method for temporal interaction localization in egocentric videos, identifying precise hand-object contact and separation moments without requiring object masks or verb-noun annotations.

DetailsMotivation: Existing methods focus on 'how to interact' but struggle with 'when to interact' - precisely localizing contact/separation moments, which is crucial for VR/AR applications and robotic policy transfer.

Method: Proposes EgoLoc with hand-dynamics-guided sampling to generate visual prompts, uses vision-language models to identify contact attributes and localize timestamps, and employs closed-loop feedback for refinement.

Result: Achieves plausible temporal interaction localization on public datasets and novel benchmarks, eliminating need for object masks and verb-noun taxonomies while enabling zero-shot implementation.

Conclusion: EgoLoc effectively facilitates downstream applications in egocentric vision and robotic manipulation, demonstrating generalizable zero-shot performance for temporal interaction localization.

Abstract: Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., “how to interact”). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., “when to interact”) is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.

[339] Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings

Zihao Lin, Zhenshan Shi, Sasa Zhao, Hanwei Zhu, Lingyu Zhu, Baoliang Chen, Lei Mo

Main category: cs.CV

TL;DR: Proposes an automatic, interpretable framework for assessing creativity in drawings using multi-modal learning that analyzes both content and style dimensions.

DetailsMotivation: Current creativity assessment relies on subjective expert scoring which is labor-intensive and inconsistent. Need for data-driven, objective evaluation methods.

Method: Multi-modal, multi-task learning framework that predicts creativity scores, categorizes content types, and extracts stylistic features using conditional learning mechanism.

Result: Achieves state-of-the-art performance compared to existing regression approaches and provides interpretable visualizations aligned with human judgments.

Conclusion: The framework enables automatic, interpretable creativity assessment from drawings by jointly modeling content and style dimensions, offering a scalable alternative to subjective expert evaluation.

Abstract: Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive understanding that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.Specifically, we first augment an existing creativity labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a multi-modal, multi-task learning framework that simultaneously predicts creativity scores, categorizes content types, and extracts stylistic features. In particular, we introduce a conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing’s stylistic and semantic cues.Experimental results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at https://github.com/WonderOfU9/CSCA_PRCV_2025

[340] ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

Kaixin Zhang, Ruiqing Yang, Yuan Zhang, Shan You, Tao Huang

Main category: cs.CV

TL;DR: ActVAR introduces dynamic activation with dual sparsity in weights and tokens for VAR models, achieving 21.2% FLOPs reduction with minimal performance loss via expert subnetworks and token selection.

DetailsMotivation: VAR models face escalating computational costs with sequence length growth, and static pruning methods degrade performance by permanently removing weights/tokens and disrupting pretrained dependencies.

Method: Decomposes FFNs into lightweight expert sub-networks with learnable router for token-specific expert selection, plus gated token selector for high-update-potential tokens while reconstructing unselected tokens. Uses two-stage knowledge distillation supervised by original VAR model.

Result: Achieves up to 21.2% FLOPs reduction on ImageNet 256×256 benchmark with minimal performance degradation.

Conclusion: ActVAR enables efficient VAR model inference through dynamic activation framework that preserves model capacity while significantly reducing computational costs.

Abstract: Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256\times 256$ benchmark demonstrate that ActVAR achieves up to $21.2%$ FLOPs reduction with minimal performance degradation.

[341] Reconstructing 3D Scenes in Native High Dynamic Range

Kaixuan Zhang, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu

Main category: cs.CV

TL;DR: NH-3DGS is the first method for 3D scene reconstruction that directly models native HDR observations using a novel luminance-chromaticity decomposition, enabling professional-grade reconstruction from single-exposure HDR captures.

DetailsMotivation: Professional digital media workflows require HDR imaging, but existing 3D reconstruction methods primarily focus on LDR data. Current HDR reconstruction approaches rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision.

Method: Proposes Native High dynamic range 3D Gaussian Splatting (NH-3DGS) with a novel luminance-chromaticity decomposition of color representation that preserves full dynamic range throughout the reconstruction pipeline, enabling direct optimization from native HDR camera data.

Result: NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation on both synthetic and real multi-view HDR datasets.

Conclusion: The method enables professional-grade 3D reconstruction directly from native HDR captures, advancing the applicability of 3D scene reconstruction to professional digital media workflows.

Abstract: High Dynamic Range (HDR) imaging is essential for professional digital media creation, e.g., filmmaking, virtual production, and photorealistic rendering. However, 3D scene reconstruction has primarily focused on Low Dynamic Range (LDR) data, limiting its applicability to professional workflows. Existing approaches that reconstruct HDR scenes from LDR observations rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision. With the recent emergence of cameras that directly capture native HDR data in a single exposure, we present the first method for 3D scene reconstruction that directly models native HDR observations. We propose {\bf Native High dynamic range 3D Gaussian Splatting (NH-3DGS)}, which preserves the full dynamic range throughout the reconstruction pipeline. Our key technical contribution is a novel luminance-chromaticity decomposition of the color representation that enables direct optimization from native HDR camera data. We demonstrate on both synthetic and real multi-view HDR datasets that NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation, enabling professional-grade 3D reconstruction directly from native HDR captures. Code and datasets will be made available.

[342] FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI

Hao Li, Zhenfeng Zhuang, Jingyu Lin, Yu Liu, Yifei Chen, Qiong Peng, Lequan Yu, Liansheng Wang

Main category: cs.CV

TL;DR: FDP is a frequency-decomposition preprocessing framework that enhances unsupervised anomaly detection in brain MRI by leveraging unique frequency patterns of anomalies while preserving normal anatomy.

DetailsMotivation: Current UAD methods use artificial noise perturbations that lack biophysical fidelity and morphological complexity of real clinical lesions. Frequency-domain analysis revealed anomalies have unique frequency patterns distinguishable from normal anatomy.

Method: Frequency-Decomposition Preprocessing (FDP) framework that uses frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. It integrates with existing anomaly simulation techniques.

Result: FDP consistently improves anomaly detection performance across diverse architectures, achieving 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines.

Conclusion: FDP effectively enhances unsupervised anomaly detection in brain MRI by leveraging frequency-domain properties, providing a systematic approach that maintains diagnostic fidelity and integrates with existing methods.

Abstract: Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize artificially generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual mapping. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework, the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines. The code is available at https://github.com/ls1rius/MRI_FDP.

[343] DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen

Main category: cs.CV

TL;DR: DeepSport is the first end-to-end trained MLLM framework for multi-task, multi-sport video understanding that enables active, iterative reasoning through dynamic frame interrogation and achieves SOTA performance.

DetailsMotivation: Address the gap in sports video understanding where current approaches are single-sport centric, limited to specific tasks, or rely on training-free paradigms lacking robust reasoning processes.

Method: Proposes an end-to-end MLLM framework with active iterative reasoning using specialized frame-extraction tool, data distillation pipeline synthesizing CoT trajectories from 10 data sources, and two-stage training (SFT + RL with gated tool-use reward).

Result: Achieves state-of-the-art performance on 6.7k question benchmark, significantly outperforming both proprietary and open-source baseline models.

Conclusion: Establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports through learned reasoning processes.

Abstract: Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos’’ by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model’s reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

[344] CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection

Yaohua Zha, Xue Yuerong, Chunlin Fan, Yuansong Wang, Tao Dai, Ke Chen, Shu-Tao Xia

Main category: cs.CV

TL;DR: The paper proposes CASL, a curvature-augmented self-supervised learning framework for 3D anomaly detection that achieves state-of-the-art performance without task-specific designs and generalizes well to other 3D understanding tasks.

DetailsMotivation: Existing 3D anomaly detection methods are task-specific and lack generalizability, while classical self-supervised models perform suboptimally for anomaly detection under unified fine-tuning. The authors aim to develop a more generalizable 3D model that effectively detects anomalies without relying on task-specific designs.

Method: Proposes CASL framework based on reconstruction paradigm using U-Net architecture with multi-scale curvature prompts to guide decoder in predicting spatial coordinates. Uses only curvature as anomaly score and achieves detection through straightforward anomaly classification fine-tuning.

Result: Outperforms classical self-supervised and dedicated anomaly detection models. Achieves leading detection performance and the learned representations generalize well to standard 3D understanding tasks like point cloud classification.

Conclusion: Curvature plays a critical role in 3D anomaly detection. CASL demonstrates that effective anomaly detection can be achieved without dedicated mechanisms, and the framework provides generalizable representations for broader 3D understanding tasks.

Abstract: Deep learning-based 3D anomaly detection methods have demonstrated significant potential in industrial manufacturing. However, many approaches are specifically designed for anomaly detection tasks, which limits their generalizability to other 3D understanding tasks. In contrast, self-supervised point cloud models aim for general-purpose representation learning, yet our investigation reveals that these classical models are suboptimal at anomaly detection under the unified fine-tuning paradigm. This motivates us to develop a more generalizable 3D model that can effectively detect anomalies without relying on task-specific designs. Interestingly, we find that using only the curvature of each point as its anomaly score already outperforms several classical self-supervised and dedicated anomaly detection models, highlighting the critical role of curvature in 3D anomaly detection. In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. Built upon the classical U-Net architecture, our approach introduces multi-scale curvature prompts to guide the decoder in predicting the spatial coordinates of each point. Without relying on any dedicated anomaly detection mechanisms, it achieves leading detection performance through straightforward anomaly classification fine-tuning. Moreover, the learned representations generalize well to standard 3D understanding tasks such as point cloud classification. The code is available at https://github.com/zyh16143998882/CASL.

[345] Explore How to Inject Beneficial Noise in MLLMs

Ruishu Zhu, Sida Huang, Ziheng Jiao, Hongyuan Zhang

Main category: cs.CV

TL;DR: Proposes MuNG, a novel fine-tuning method that injects beneficial random noise into frozen MLLMs to improve cross-modal alignment, outperforming full fine-tuning with only 1-2% additional parameters.

DetailsMotivation: Existing fine-tuning methods for Multimodal Large Language Models (MLLMs) often ignore cross-modal heterogeneity, limiting their full potential for multimodal intelligence.

Method: Develops Multimodal Noise Generator (MuNG) that analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise, which is injected into frozen MLLMs to suppress irrelevant semantic components.

Result: Experiments on QwenVL and LLaVA show the method surpasses full-parameter fine-tuning and other existing approaches while requiring only 1-2% additional parameters.

Conclusion: The proposed noise injection strategy effectively improves cross-modal representation alignment and enhances downstream task performance with minimal parameter overhead.

Abstract: Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2%$ additional parameters. The relevant code is uploaded in the supplementary.

[346] CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Dexin Zuo, Ang Li, Wei Wang, Wenxian Yu, Danping Zou

Main category: cs.CV

TL;DR: CoordAR is an autoregressive framework for 6D pose estimation of unseen objects using only one reference view, addressing limitations of existing methods through discrete tokenization and probabilistic modeling.

DetailsMotivation: To overcome challenges in 6D pose estimation for novel objects without 3D models, particularly addressing limited global consistency in existing methods and difficulties with symmetric/occluded scenarios due to lack of uncertainty modeling.

Method: Formulates 3D-3D correspondences as discrete tokens using coordinate map tokenization, modality-decoupled encoding for RGB and coordinate cues, and autoregressive transformer decoder conditioned on query features and partial token sequences.

Result: Significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other real-world challenges.

Conclusion: CoordAR provides an effective autoregressive approach for one-reference 6D pose estimation that overcomes key limitations of previous methods through probabilistic modeling and discrete tokenization.

Abstract: Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.

[347] Generative Photographic Control for Scene-Consistent Video Cinematic Editing

Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu, Yuhang Zang, Tianqi Liu, Zihao Huang, Xingyu Zeng, Zhiguo Cao, Wei Li, Chen Change Loy

Main category: cs.CV

TL;DR: CineCtrl is the first video cinematic editing framework that enables fine control over professional camera parameters like bokeh and shutter speed, overcoming limitations of existing methods that only handle camera motion.

DetailsMotivation: Existing generative video models lack control over photographic elements like depth of field and exposure, which are crucial for cinematic storytelling and aesthetic appeal.

Method: Introduces a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, and develops a comprehensive data generation strategy using simulated effects and real-world collection to build a large-scale training dataset.

Result: The model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects, as demonstrated through extensive experiments.

Conclusion: CineCtrl successfully enables fine-grained, independent control over professional camera parameters in video generation without compromising scene consistency.

Abstract: Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

[348] Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

Feng Lv, Haoxuan Feng, Zilu Zhang, Chunlong Xia, Yanfeng Li

Main category: cs.CV

TL;DR: A unified text-driven framework for traffic scene image generation and editing that addresses semantic richness, viewpoint diversity, visual fidelity, and text-image alignment through controllable masks, multi-view data, and specialized training strategies.

DetailsMotivation: To overcome challenges in traffic scene generation including insufficient semantic richness, limited camera viewpoints, low visual fidelity, and poor text-image alignment for intelligent transportation systems.

Method: Uses a unified framework with controllable mask mechanism, incorporates vehicle-side and roadside multi-view data, employs two-stage training (conceptual learning then fine-tuning), and introduces mask-region-weighted loss for small traffic elements.

Result: Achieves leading performance in text-based image generation and editing within traffic scenes, with enhanced geometric diversity and generation fidelity.

Conclusion: The proposed framework effectively addresses key challenges in traffic scene generation and editing, providing a robust solution for intelligent transportation applications.

Abstract: With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.

[349] PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu, Ruiyuan Zhang, Jiayuan Lu, Rui Wang, Yuchi Huo

Main category: cs.CV

TL;DR: PFAvatar is a two-stage method for reconstructing high-quality 3D avatars from OOTD photos using pose-aware diffusion fine-tuning and NeRF-based distillation, achieving 48× speed-up and superior detail preservation.

DetailsMotivation: To overcome limitations of previous methods that segment images into assets for 3D assembly (prone to inconsistency) and handle challenges in OOTD photos like diverse poses, occlusions, and complex backgrounds.

Method: Two-stage approach: (1) Fine-tune pose-aware diffusion model with ControlNet for pose estimation and Condition Prior Preservation Loss (CPPL) for few-shot training; (2) Distill 3D avatar using NeRF representation with canonical SMPL-X space sampling and Multi-Resolution 3D-SDS.

Result: Achieves 48× speed-up (5 minutes personalization), outperforms state-of-the-art in reconstruction fidelity, detail preservation, and robustness to occlusions/truncations. Preserves high-frequency textures and handles occlusions correctly.

Conclusion: PFAvatar advances practical 3D avatar generation from real-world OOTD albums and supports downstream applications like virtual try-on, animation, and human video reenactment.

Abstract: We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day’’ (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48$\times$ speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

[350] ProtoAnomalyNCD: Prototype Learning for Multi-class Novel Anomaly Discovery in Industrial Scenarios

Botong Zhao, Qijun Shi, Shujing Lyu, Yue Lu

Main category: cs.CV

TL;DR: ProtoAnomalyNCD is a prototype-learning framework that discovers and classifies multiple unseen anomaly types in industrial settings using object localization and anomaly-map-guided attention.

DetailsMotivation: Real-world industrial applications require discovering and classifying multiple anomaly types, not just detecting presence. Current methods struggle with semantically subtle anomalies and insufficient image prior exploitation.

Method: Uses Grounded SAM with text prompts to localize objects, introduces Anomaly-Map-Guided Attention with Region Guidance Factor to distinguish regions, and employs prototype learning for clustering unseen anomaly classes.

Result: Outperforms state-of-the-art approaches on MVTec AD, MTD, and Real-IAD datasets.

Conclusion: The framework successfully discovers unseen anomaly classes while enabling multi-type classification and achieves task-level unification for outlier detection.

Abstract: Existing industrial anomaly detection methods mainly determine whether an anomaly is present. However, real-world applications also require discovering and classifying multiple anomaly types. Since industrial anomalies are semantically subtle and current methods do not sufficiently exploit image priors, direct clustering approaches often perform poorly. To address these challenges, we propose ProtoAnomalyNCD, a prototype-learning-based framework for discovering unseen anomaly classes of multiple types that can be integrated with various anomaly detection methods. First, to suppress background clutter, we leverage Grounded SAM with text prompts to localize object regions as priors for the anomaly classification network. Next, because anomalies usually appear as subtle and fine-grained patterns on the product, we introduce an Anomaly-Map-Guided Attention block. Within this block, we design a Region Guidance Factor that helps the attention module distinguish among background, object regions, and anomalous regions. By using both localized product regions and anomaly maps as priors, the module enhances anomalous features while suppressing background noise and preserving normal features for contrastive learning. Finally, under a unified prototype-learning framework, ProtoAnomalyNCD discovers and clusters unseen anomaly classes while simultaneously enabling multi-type anomaly classification. We further extend our method to detect unseen outliers, achieving task-level unification. Our method outperforms state-of-the-art approaches on the MVTec AD, MTD, and Real-IAD datasets.

[351] Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking

Wei Jiang, Jiahao Cui, Yizheng Wu, Zhan Peng, Zhiyu Pan, Zhiguo Cao

Main category: cs.CV

TL;DR: Semi-supervised HDR image reconstruction using teacher-student framework with uncertainty-based masking to reduce artifacts from pseudo ground truths, achieving comparable performance to fully-supervised methods with only 6.7% HDR ground truths.

DetailsMotivation: HDR image reconstruction from LDR bursts requires LDR-HDR image pairs which are hard to obtain, motivating the need for annotation-efficient methods that can achieve good performance with limited HDR ground truths.

Method: Uses semi-supervised learning with teacher-student framework where teacher generates pseudo HDR ground truths, then applies uncertainty-based masking at pixel and patch levels to discard unreliable parts of pseudo ground truths before student learns from trusted areas.

Result: Outperforms previous annotation-efficient algorithms and achieves comparable performance with state-of-the-art fully-supervised methods using only 6.7% HDR ground truths.

Conclusion: The proposed uncertainty-based masking process effectively addresses confirmation bias in semi-supervised HDR reconstruction, enabling high performance with significantly reduced annotation requirements.

Abstract: Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. To remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs at both pixel and patch levels, then the trusted areas can be learned from by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.

[352] Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin

Main category: cs.CV

TL;DR: Introduces RAD framework with LSTM-enhanced diffusion transformers for long video generation, addressing memory limitations and training-inference gaps in existing approaches.

DetailsMotivation: Existing video diffusion models lack effective memory compression for long-term generation beyond window size, causing forgetting and spatiotemporal inconsistencies.

Method: Proposes Recurrent Autoregressive Diffusion (RAD) framework with LSTM integration in diffusion transformers, enabling frame-wise autoregression with consistent memory update across training and inference.

Result: Achieves superior performance on Memory Maze and Minecraft datasets, demonstrating LSTM’s efficiency in sequence modeling for long video generation.

Conclusion: RAD framework effectively enhances historical information retention within fixed memory budget, outperforming existing approaches in long video generation tasks.

Abstract: Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

[353] EndoSight AI: Deep Learning-Driven Real-Time Gastrointestinal Polyp Detection and Segmentation for Enhanced Endoscopic Diagnostics

Daniel Cavadia

Main category: cs.CV

TL;DR: EndoSight AI is a deep learning system for real-time gastrointestinal polyp detection and segmentation during endoscopy, achieving 88.3% mAP for detection and 69% Dice coefficient for segmentation with speeds over 35 FPS.

DetailsMotivation: Precise and real-time detection of gastrointestinal polyps during endoscopic procedures is crucial for early diagnosis and prevention of colorectal cancer.

Method: Deep learning architecture using the Hyper-Kvasir dataset, incorporating clinically relevant metrics and a novel thermal-aware procedure for model robustness and efficiency.

Result: Achieves 88.3% mAP for polyp detection, 69% Dice coefficient for segmentation, and real-time inference speeds exceeding 35 FPS on GPU hardware.

Conclusion: The integrated AI solution is designed for seamless deployment in endoscopy workflows, promising to advance diagnostic accuracy and clinical decision-making in gastrointestinal healthcare.

Abstract: Precise and real-time detection of gastrointestinal polyps during endoscopic procedures is crucial for early diagnosis and prevention of colorectal cancer. This work presents EndoSight AI, a deep learning architecture developed and evaluated independently to enable accurate polyp localization and detailed boundary delineation. Leveraging the publicly available Hyper-Kvasir dataset, the system achieves a mean Average Precision (mAP) of 88.3% for polyp detection and a Dice coefficient of up to 69% for segmentation, alongside real-time inference speeds exceeding 35 frames per second on GPU hardware. The training incorporates clinically relevant performance metrics and a novel thermal-aware procedure to ensure model robustness and efficiency. This integrated AI solution is designed for seamless deployment in endoscopy workflows, promising to advance diagnostic accuracy and clinical decision-making in gastrointestinal healthcare.

[354] T2I-Based Physical-World Appearance Attack against Traffic Sign Recognition Systems in Autonomous Driving

Chen Ma, Ningfei Wang, Junhao Zheng, Qing Guo, Qian Wang, Qi Alfred Chen, Chao Shen

Main category: cs.CV

TL;DR: DiffSign is a novel text-to-image based adversarial attack framework that generates physically robust, effective, transferable, and stealthy appearance attacks against Traffic Sign Recognition systems in autonomous driving.

DetailsMotivation: Existing adversarial appearance attacks on TSR systems have limitations: pixel-level perturbation methods lack stealthiness and transferability, while T2I diffusion model approaches show limited effectiveness and poor generalization to out-of-distribution sign types.

Method: Proposes a T2I-based attack pipeline with CLIP-based loss and masked prompts for improved focus and controllability, plus two novel style customization methods to guide visual appearance and enhance out-of-domain generalization and stealthiness.

Result: Achieves 83.3% average physical-world attack success rate under varied real-world conditions (different distances, angles, light conditions, and sign categories), demonstrating high effectiveness and transferability.

Conclusion: DiffSign overcomes limitations of prior approaches by generating physically robust, highly effective, transferable, practical, and stealthy appearance attacks against TSR systems.

Abstract: Traffic Sign Recognition (TSR) systems play a critical role in Autonomous Driving (AD) systems, enabling real-time detection of road signs, such as STOP and speed limit signs. While these systems are increasingly integrated into commercial vehicles, recent research has exposed their vulnerability to physical-world adversarial appearance attacks. In such attacks, carefully crafted visual patterns are misinterpreted by TSR models as legitimate traffic signs, while remaining inconspicuous or benign to human observers. However, existing adversarial appearance attacks suffer from notable limitations. Pixel-level perturbation-based methods often lack stealthiness and tend to overfit to specific surrogate models, resulting in poor transferability to real-world TSR systems. On the other hand, text-to-image (T2I) diffusion model-based approaches demonstrate limited effectiveness and poor generalization to out-of-distribution sign types. In this paper, we present DiffSign, a novel T2I-based appearance attack framework designed to generate physically robust, highly effective, transferable, practical, and stealthy appearance attacks against TSR systems. To overcome the limitations of prior approaches, we propose a carefully designed attack pipeline that integrates CLIP-based loss and masked prompts to improve attack focus and controllability. We also propose two novel style customization methods to guide visual appearance and improve out-of-domain traffic sign attack generalization and attack stealthiness. We conduct extensive evaluations of DiffSign under varied real-world conditions, including different distances, angles, light conditions, and sign categories. Our method achieves an average physical-world attack success rate of 83.3%, leveraging DiffSign’s high effectiveness in attack transferability.

[355] CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models

Mehrab Mustafy Rahman, Jayanth Mohan, Tiberiu Sosea, Cornelia Caragea

Main category: cs.CV

TL;DR: CalibrateMix improves semi-supervised learning model calibration using targeted mixup of easy and hard samples, achieving better accuracy and lower calibration error than existing SSL methods.

DetailsMotivation: Existing SSL methods suffer from poor calibration with overconfident predictions, and while mixup helps in supervised settings, it's challenging in SSL due to unreliable pseudolabels.

Method: CalibrateMix uses training dynamics to identify easy and hard samples, then performs targeted mixup between these sample types to improve calibration.

Result: Experimental results show CalibrateMix achieves lower expected calibration error (ECE) and superior accuracy across multiple benchmark image datasets.

Conclusion: Targeted mixup based on sample difficulty effectively improves SSL model calibration while maintaining or enhancing classification performance.

Abstract: Semi-supervised learning (SSL) has demonstrated high performance in image classification tasks by effectively utilizing both labeled and unlabeled data. However, existing SSL methods often suffer from poor calibration, with models yielding overconfident predictions that misrepresent actual prediction likelihoods. Recently, neural networks trained with {\tt mixup} that linearly interpolates random examples from the training set have shown better calibration in supervised settings. However, calibration of neural models remains under-explored in semi-supervised settings. Although effective in supervised model calibration, random mixup of pseudolabels in SSL presents challenges due to the overconfidence and unreliability of pseudolabels. In this work, we introduce CalibrateMix, a targeted mixup-based approach that aims to improve the calibration of SSL models while maintaining or even improving their classification accuracy. Our method leverages training dynamics of labeled and unlabeled samples to identify easy-to-learn'' and hard-to-learn’’ samples, which in turn are utilized in a targeted mixup of easy and hard samples. Experimental results across several benchmark image datasets show that our method achieves lower expected calibration error (ECE) and superior accuracy compared to existing SSL approaches.

[356] GrOCE:Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models

Ning Han, Zhenyu Ge, Feng Han, Yuhua Sun, Chengqing Li, Jingjing Chen

Main category: cs.CV

TL;DR: GrOCE is a training-free framework for precise concept erasure in text-to-image diffusion models using graph-based semantic reasoning, achieving SOTA performance without retraining.

DetailsMotivation: Existing concept erasure methods either require costly fine-tuning or use coarse semantic separation, which degrades unrelated concepts and lacks adaptability to evolving concept sets.

Method: GrOCE models concepts as a dynamic semantic graph with three components: Dynamic Topological Graph Construction, Adaptive Cluster Identification with multi-hop traversal and similarity-decay scoring, and Selective Edge Severing for targeted removal.

Result: Extensive experiments show GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics.

Conclusion: GrOCE offers efficient, accurate, and stable concept erasure without retraining through principled graph-based reasoning over semantic dependencies.

Abstract: Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.

[357] HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

Ziqiao Weng, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee AD Cooper, Weidong Cai, Bo Zhou

Main category: cs.CV

TL;DR: HiFusion is a deep learning framework that predicts spatial transcriptomics gene expression from H&E-stained whole-slide images using hierarchical intra-spot modeling and context-aware cross-scale fusion, achieving state-of-the-art performance.

DetailsMotivation: Existing methods for predicting gene expression from histopathology images fail to capture biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual tissue information, limiting clinical adoption of spatial transcriptomics.

Method: HiFusion integrates two components: Hierarchical Intra-Spot Modeling that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition with feature alignment, and Context-aware Cross-scale Fusion that selectively incorporates relevant regional context using cross-attention.

Result: Extensive experiments on two benchmark ST datasets show HiFusion achieves state-of-the-art performance in both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios.

Conclusion: HiFusion demonstrates potential as a robust, accurate, and scalable solution for spatial transcriptomics inference from routine histopathology, overcoming limitations of existing approaches.

Abstract: Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion’s potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

[358] MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning

Yoonjae Seo, Ermal Elbasani, Jaehong Lee

Main category: cs.CV

TL;DR: MCAQ-YOLO introduces morphology-aware quantization for object detection, using five morphological metrics to guide spatially adaptive bit allocation and achieving better accuracy than uniform quantization with minimal overhead.

DetailsMotivation: Most neural network quantization methods use uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data, which limits efficiency and performance.

Method: Uses five morphological metrics (fractal dimension, texture entropy, gradient variance, edge density, contour complexity) to characterize local visual morphology and guide spatially adaptive bit allocation, combined with curriculum-based quantization-aware training.

Result: Achieves 85.6% mAP@0.5 with average 4.2 bits and 7.6x compression ratio, outperforming uniform 4-bit quantization by 3.5 percentage points with only 1.8ms additional runtime overhead. Consistent gains validated on COCO and Pascal VOC datasets.

Conclusion: Morphology-driven spatial quantization enhances efficiency and robustness for computationally constrained, safety-critical visual recognition tasks, demonstrating strong correlation between morphological complexity and quantization sensitivity.

Abstract: Most neural network quantization methods apply uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data. This paper introduces MCAQ-YOLO, a morphological complexity-aware quantization framework for object detection. The framework employs five morphological metrics - fractal dimension, texture entropy, gradient variance, edge density, and contour complexity - to characterize local visual morphology and guide spatially adaptive bit allocation. By correlating these metrics with quantization sensitivity, MCAQ-YOLO dynamically adjusts bit precision according to spatial complexity. In addition, a curriculum-based quantization-aware training scheme progressively increases quantization difficulty to stabilize optimization and accelerate convergence. Experimental results demonstrate a strong correlation between morphological complexity and quantization sensitivity and show that MCAQ-YOLO achieves superior detection accuracy and convergence efficiency compared with uniform quantization. On a safety equipment dataset, MCAQ-YOLO attains 85.6 percent mAP@0.5 with an average of 4.2 bits and a 7.6x compression ratio, yielding 3.5 percentage points higher mAP than uniform 4-bit quantization while introducing only 1.8 ms of additional runtime overhead per image. Cross-dataset validation on COCO and Pascal VOC further confirms consistent performance gains, indicating that morphology-driven spatial quantization can enhance efficiency and robustness for computationally constrained, safety-critical visual recognition tasks.

[359] UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu, Shaobo Wang, Jiajun Zhang, Chenghao Sun, Haixiang Tang, Linfeng Zhang

Main category: cs.CV

TL;DR: UNSEEN is a plug-and-play framework for dataset pruning that scores samples based on models not exposed to them during training, addressing limitations of fitting-centric approaches and scaling to multi-step scenarios for improved coreset quality.

DetailsMotivation: Existing dataset pruning methods rely on fitting-centric approaches where sample scores become dense and less distinguishable as models achieve near-optimal performance on training data, hindering effective selection.

Method: Propose UNSEEN framework that scores samples from generalization perspective using models not exposed to them during training, with multi-step incremental selection through models trained on varying coresets for dynamic optimization.

Result: Significantly outperforms SOTA methods on CIFAR-10, CIFAR-100, and ImageNet-1K, achieving lossless performance while reducing training data by 30% on ImageNet-1K.

Conclusion: Generalization-based scoring and multi-step incremental selection effectively address limitations of fitting-centric approaches, enabling more effective dataset pruning with comparable performance to full datasets.

Abstract: The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model’s performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30%.

[360] ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes

Yixuan Yang, Luyang Xie, Zhen Luo, Zixiang Zhao, Mingqi Gao, Feng Zheng

Main category: cs.CV

TL;DR: ArtiWorld automatically converts rigid 3D objects into articulated URDF models using scene descriptions and LLM knowledge, outperforming existing methods while preserving geometry.

DetailsMotivation: Manual conversion of rigid 3D assets to articulated objects is labor-intensive, creating a need for automated methods to build interactive simulators and robot-learning environments.

Method: Uses Arti4URDF pipeline with 3D point clouds, LLM prior knowledge, and URDF-oriented prompts to identify articulable objects and reconstruct executable URDF models.

Result: Outperforms existing approaches across 3D simulated objects, full scenes, and real-world scans, achieving state-of-the-art performance while preserving geometry and interactivity.

Conclusion: Provides practical path for building interactive, robot-ready simulation environments directly from existing 3D assets.

Abstract: Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.

[361] SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

Wenqian Ye, Di Wang, Guangtao Zheng, Bohan Liu, Aidong Zhang

Main category: cs.CV

TL;DR: SAGE is a zero-shot method that mitigates multimodal spurious bias in CLIP models by selecting prompts that maximize semantic separation between classes, improving robustness without training or fine-tuning.

DetailsMotivation: CLIP models develop spurious biases where they rely on background correlations rather than object features, harming out-of-distribution performance. Existing methods require fine-tuning or bias knowledge, compromising CLIP's out-of-the-box usability.

Method: SAGE explores prompt templates and selects those that create the largest semantic separation between classes, guided by theoretical analysis of multimodal spurious bias in zero-shot classification.

Result: Extensive experiments on 4 benchmark datasets and 5 backbone models show SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without external knowledge.

Conclusion: SAGE effectively mitigates multimodal spurious bias in CLIP models through guided prompt selection, enhancing robustness while maintaining zero-shot capabilities without requiring training or fine-tuning.

Abstract: Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object’s core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

[362] Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

Main category: cs.CV

TL;DR: CCI is a new interpretability method for CLIP that uses patch clustering to identify important concepts, achieving state-of-the-art faithfulness and enabling automatic foreground/background analysis of predictions.

DetailsMotivation: Contrastive VLMs like CLIP suffer from spurious correlations and background over-reliance, but existing interpretability methods are limited in faithfulness and diagnostic capabilities.

Method: CCI clusters CLIP’s patch embeddings into semantic groups, masks them, and evaluates prediction changes. Combined with GroundedSAM, it categorizes predictions as foreground- or background-driven.

Result: CCI achieves more than twofold improvement on deletion-AUC for MS COCO retrieval and sets new SOTA on faithfulness benchmarks. COVAR benchmark reveals that background correlations aren’t the only source of errors - viewpoint, scale, and fine-grained confusions also contribute.

Conclusion: CCI provides superior interpretability for VLMs, and the COVAR benchmark enables comprehensive evaluation of model robustness, charting a path toward more reliable vision-language models.

Abstract: Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP’s own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.

[363] MeanFlow Transformers with Representation Autoencoders

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

Main category: cs.CV

TL;DR: Efficient training and sampling scheme for MeanFlow using Representation Autoencoder latent space, achieving better performance with reduced computational costs.

DetailsMotivation: MeanFlow training is computationally demanding and unstable, with SD-VAE decoder dominating generation cost and complex guidance requirements for class-conditional generation.

Method: Uses RAE latent space with pre-trained vision encoder, implements Consistency Mid-Training for initialization, two-stage training with distillation from pre-trained flow matching teacher and optional bootstrapping with one-point velocity estimator.

Result: Achieves 1-step FID of 2.03 (vs vanilla MF’s 3.43), reduces sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. Scales to ImageNet 512 with 1-step FID of 3.23.

Conclusion: Proposed method removes need for guidance, simplifies training configurations, and significantly reduces computation while achieving superior performance.

Abstract: MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF’s 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.

[364] Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection

Lintong Zhang, Kang Yin, Seong-Whan Lee

Main category: cs.CV

TL;DR: WSAE-Net improves visual counterfactual explanations by using weighted semantic maps and auto-adaptive editing sequences to maintain semantic relevance and optimize computational efficiency.

DetailsMotivation: Traditional visual counterfactual explanation methods often replace image sections without considering semantic relevance to the target object, which reduces interpretability and hinders editing workflows.

Method: Proposes WSAE-Net with two key innovations: weighted semantic maps to reduce non-semantic feature computations, and auto-adaptive candidate editing sequences to determine optimal processing order while maintaining semantic relevance.

Result: The methodology demonstrates superior performance in comprehensive experiments, enabling more efficient generation of counterfactuals while preserving semantic relationships.

Conclusion: WSAE-Net contributes to clearer and deeper understanding of visual counterfactual explanations by addressing semantic relevance and computational efficiency challenges in traditional approaches.

Abstract: In the domain of non-generative visual counterfactual explanations (CE), traditional techniques frequently involve the substitution of sections within a query image with corresponding sections from distractor images. Such methods have historically overlooked the semantic relevance of the replacement regions to the target object, thereby impairing the model’s interpretability and hindering the editing workflow. Addressing these challenges, the present study introduces an innovative methodology named as Weighted Semantic Map with Auto-adaptive Candidate Editing Network (WSAE-Net). Characterized by two significant advancements: the determination of an weighted semantic map and the auto-adaptive candidate editing sequence. First, the generation of the weighted semantic map is designed to maximize the reduction of non-semantic feature units that need to be computed, thereby optimizing computational efficiency. Second, the auto-adaptive candidate editing sequences are designed to determine the optimal computational order among the feature units to be processed, thereby ensuring the efficient generation of counterfactuals while maintaining the semantic relevance of the replacement feature units to the target object. Through comprehensive experimentation, our methodology demonstrates superior performance, contributing to a more lucid and in-depth understanding of visual counterfactual explanations.

[365] SpectralAdapt: Semi-Supervised Domain Adaptation with Spectral Priors for Human-Centered Hyperspectral Image Reconstruction

Yufei Wen, Yuting Zhang, Jingdan Kang, Hao Ren, Weibin Cheng, Jintai Chen, Kaishun Wu

Main category: cs.CV

TL;DR: SpectralAdapt is a semi-supervised domain adaptation framework that bridges the domain gap between general and human-centered hyperspectral imaging datasets, addressing data scarcity in medical HSI applications through spectral density masking and endmember representation alignment.

DetailsMotivation: Hyperspectral imaging has great potential for healthcare but faces challenges with costly data acquisition and scarcity of human HSI data, limiting progress in medical applications despite abundant general domain datasets.

Method: Proposes SpectralAdapt framework with two key components: Spectral Density Masking (SDM) that adaptively masks RGB channels based on spectral complexity, and Spectral Endmember Representation Alignment (SERA) that uses physically interpretable endmembers as domain-invariant anchors with momentum updates.

Result: Experiments show consistent improvements in spectral fidelity, cross-domain generalization, and training stability on benchmark datasets.

Conclusion: SpectralAdapt effectively mitigates domain shift, spectral degradation, and data scarcity in HSI reconstruction, demonstrating SSDA as an efficient solution for hyperspectral imaging in healthcare.

Abstract: Hyperspectral imaging (HSI) holds great potential for healthcare due to its rich spectral information. However, acquiring HSI data remains costly and technically demanding. Hyperspectral image reconstruction offers a practical solution by recovering HSI data from accessible modalities, such as RGB. While general domain datasets are abundant, the scarcity of human HSI data limits progress in medical applications. To tackle this, we propose SpectralAdapt, a semi-supervised domain adaptation (SSDA) framework that bridges the domain gap between general and human-centered HSI datasets. To fully exploit limited labels and abundant unlabeled data, we enhance spectral reasoning by introducing Spectral Density Masking (SDM), which adaptively masks RGB channels based on their spectral complexity, encouraging recovery of informative regions from complementary cues during consistency training. Furthermore, we introduce Spectral Endmember Representation Alignment (SERA), which derives physically interpretable endmembers from valuable labeled pixels and employs them as domain-invariant anchors to guide unlabeled predictions, with momentum updates ensuring adaptability and stability. These components are seamlessly integrated into SpectralAdapt, a spectral prior-guided framework that effectively mitigates domain shift, spectral degradation, and data scarcity in HSI reconstruction. Experiments on benchmark datasets demonstrate consistent improvements in spectral fidelity, cross-domain generalization, and training stability, highlighting the promise of SSDA as an efficient solution for hyperspectral imaging in healthcare.

[366] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li

Main category: cs.CV

TL;DR: PerTouch is a diffusion-based image retouching framework that balances controllability and subjectivity by using parameter maps for semantic-level editing while maintaining global aesthetics, with VLM-driven agents for natural language instruction handling.

DetailsMotivation: To address the challenge of balancing controllability and subjectivity in image retouching while aligning with users' personalized aesthetic preferences.

Method: Uses parameter maps with attribute values in semantic regions as input to construct explicit parameter-to-image mapping, with semantic replacement and parameter perturbation mechanisms for better boundary perception, and VLM-driven agents with feedback-driven rethinking and scene-aware memory.

Result: Extensive experiments demonstrate each component’s effectiveness and superior performance in personalized image retouching.

Conclusion: PerTouch provides an effective unified framework for personalized image retouching that better aligns with user intent and captures long-term preferences.

Abstract: Image retouching aims to enhance visual quality while aligning with users’ personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

[367] Medal S: Spatio-Textual Prompt Model for Medical Segmentation

Pengcheng Shi, Jiawei Chen, Jiaqi Liu, Xinglin Zhang, Tao Chen, Lei Li

Main category: cs.CV

TL;DR: Medal S is a medical segmentation foundation model that supports native-resolution spatial and textual prompts in an end-to-end framework, achieving superior multi-class segmentation across multiple medical imaging modalities with 90% faster inference than sequential methods.

DetailsMotivation: To address limitations of text-only methods lacking spatial awareness and resolution mismatches in medical segmentation, while enabling efficient multi-class segmentation across diverse medical imaging modalities.

Method: Uses channel-wise alignment between volumetric prompts and text embeddings, preserves full 3D context, employs lightweight 3D convolutional module for voxel-space refinement, supports two prompting modes (text-only and hybrid), and implements dynamic resampling for data augmentation.

Result: Outperforms SAT with DSC of 75.44 vs 69.83, NSD of 77.34 vs 71.06, F1 of 38.24 vs 24.88, and DSC TP of 65.46 vs 46.97 on five-modality average, while reducing inference time by over 90% for 24-class segmentation.

Conclusion: Medal S successfully harmonizes spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches.

Abstract: We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.

[368] Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im

Main category: cs.CV

TL;DR: Infinite-Story is a training-free framework for consistent text-to-image generation in multi-prompt storytelling, achieving state-of-the-art performance with 6x faster inference than existing methods.

DetailsMotivation: Address identity and style inconsistency challenges in consistent text-to-image generation for multi-prompt storytelling scenarios.

Method: Uses scale-wise autoregressive model with three techniques: Identity Prompt Replacement to mitigate context bias, and unified attention guidance with Adaptive Style Injection and Synchronized Guidance Adaptation for global consistency.

Result: Achieves high identity and style consistency across diverse prompts with 1.72 seconds per image inference speed (6x faster than existing methods).

Conclusion: The framework is effective and practical for real-world visual storytelling, operating entirely at test time without fine-tuning.

Abstract: We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

[369] Beyond Darkness: Thermal-Supervised 3D Gaussian Splatting for Low-Light Novel View Synthesis

Qingsen Ma, Chen Zou, Dianyun Wang, Jia Wang, Liuyu Xiang, Zhaofeng He

Main category: cs.CV

TL;DR: DTGS is a unified framework that combines Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for robust novel view synthesis under extreme low-light conditions.

DetailsMotivation: Standard 3DGS pipelines fail under extremely low-light conditions due to illumination inconsistencies and geometric distortion when applied to underexposed inputs, requiring a solution that can handle severe illumination degradation.

Method: DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism, using thermal guidance to stabilize color restoration and geometry learning, and embedding Retinex-based decomposition within the 3DGS loop.

Result: Extensive experiments on the new RGBT-LOW dataset show DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.

Conclusion: DTGS provides a unified solution for illumination-invariant reconstruction under extreme low-light conditions by tightly coupling enhancement with 3D reconstruction through thermal guidance and Retinex decomposition.

Abstract: Under extremely low-light conditions, novel view synthesis (NVS) faces severe degradation in terms of geometry, color consistency, and radiometric stability. Standard 3D Gaussian Splatting (3DGS) pipelines fail when applied directly to underexposed inputs, as independent enhancement across views causes illumination inconsistencies and geometric distortion. To address this, we present DTGS, a unified framework that tightly couples Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for illumination-invariant reconstruction. Unlike prior approaches that treat enhancement as a pre-processing step, DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism. A thermal supervisory branch stabilizes both color restoration and geometry learning by dynamically balancing enhancement, structural, and thermal losses. Moreover, a Retinex-based decomposition module embedded within the 3DGS loop provides physically interpretable reflectance-illumination separation, ensuring consistent color and texture across viewpoints. To evaluate our method, we construct RGBT-LOW, a new multi-view low-light thermal dataset capturing severe illumination degradation. Extensive experiments show that DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.

[370] You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection

Guoyi Zhang, Guangsheng Xu, Siyang Chen, Han Wang, Xiaohu Zhang

Main category: cs.CV

TL;DR: BP-FPN is a backpropagation-driven feature pyramid network for infrared small target detection that addresses fundamental bottlenecks in per-frame feature representation through gradient-isolated shortcuts and directional gradient regularization.

DetailsMotivation: Existing deep learning methods for infrared small target detection focus on spatio-temporal feature aggregation but have limited gains, revealing that the fundamental bottleneck lies in ambiguous per-frame feature representations rather than spatio-temporal modeling itself.

Method: Proposes BP-FPN with Gradient-Isolated Low-Level Shortcut (GILS) to incorporate fine-grained target details without shortcut learning, and Directional Gradient Regularization (DGR) to enforce hierarchical feature consistency during backpropagation.

Result: Extensive experiments on multiple public datasets show that BP-FPN consistently establishes new state-of-the-art performance.

Conclusion: BP-FPN is the first FPN designed for infrared small target detection entirely from the backpropagation perspective, offering theoretically grounded design with negligible computational overhead that can be seamlessly integrated into existing frameworks.

Abstract: Moving infrared small target detection is a key component of infrared search and tracking systems, yet it remains extremely challenging due to low signal-to-clutter ratios, severe target-background imbalance, and weak discriminative features. Existing deep learning methods primarily focus on spatio-temporal feature aggregation, but their gains are limited, revealing that the fundamental bottleneck lies in ambiguous per-frame feature representations rather than spatio-temporal modeling itself. Motivated by this insight, we propose BP-FPN, a backpropagation-driven feature pyramid architecture that fundamentally rethinks feature learning for small target. BP-FPN introduces Gradient-Isolated Low-Level Shortcut (GILS) to efficiently incorporate fine-grained target details without inducing shortcut learning, and Directional Gradient Regularization (DGR) to enforce hierarchical feature consistency during backpropagation. The design is theoretically grounded, introduces negligible computational overhead, and can be seamlessly integrated into existing frameworks. Extensive experiments on multiple public datasets show that BP-FPN consistently establishes new state-of-the-art performance. To the best of our knowledge, it is the first FPN designed for this task entirely from the backpropagation perspective.

[371] Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues

King-Man Tam, Satoshi Ikehata, Yuta Asano, Zhaoyi An, Rei Kawakami

Main category: cs.CV

TL;DR: GeoUniPS is a universal photometric stereo network that integrates synthetic supervision with geometric priors from 3D reconstruction models to handle complex in-the-wild scenes where traditional methods fail.

DetailsMotivation: Universal photometric stereo struggles with unreliable multi-illumination cues in biased lighting, shadows, and self-occluded regions of complex real-world scenes.

Method: Uses a Light-Geometry Dual-Branch Encoder to extract multi-illumination cues and geometric priors from frozen 3D reconstruction models, and introduces PS-Perp dataset with perspective projection to address orthographic projection limitations.

Result: State-of-the-art performance across multiple datasets, especially in complex in-the-wild scenes, both quantitatively and qualitatively.

Conclusion: Integrating geometric priors from 3D foundation models significantly improves universal photometric stereo performance in challenging real-world conditions.

Abstract: Universal Photometric Stereo is a promising approach for recovering surface normals without strict lighting assumptions. However, it struggles when multi-illumination cues are unreliable, such as under biased lighting or in shadows or self-occluded regions of complex in-the-wild scenes. We propose GeoUniPS, a universal photometric stereo network that integrates synthetic supervision with high-level geometric priors from large-scale 3D reconstruction models pretrained on massive in-the-wild data. Our key insight is that these 3D reconstruction models serve as visual-geometry foundation models, inherently encoding rich geometric knowledge of real scenes. To leverage this, we design a Light-Geometry Dual-Branch Encoder that extracts both multi-illumination cues and geometric priors from the frozen 3D reconstruction model. We also address the limitations of the conventional orthographic projection assumption by introducing the PS-Perp dataset with realistic perspective projection to enable learning of spatially varying view directions. Extensive experiments demonstrate that GeoUniPS delivers state-of-the-arts performance across multiple datasets, both quantitatively and qualitatively, especially in the complex in-the-wild scenes.

[372] Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

Yehonatan Elisha, Seffi Cohen, Oren Barkan, Noam Koenigstein

Main category: cs.CV

TL;DR: The paper introduces RFxG taxonomy to organize saliency explanations by reference-frame (pointwise vs contrastive) and granularity (class-level vs group-level), revealing limitations in current evaluation metrics and proposing new faithfulness metrics for comprehensive assessment.

DetailsMotivation: There is fundamental lack of consensus about saliency maps' intended purpose and alignment with user queries, which hinders effective evaluation and practical utility of explanation methods.

Method: Proposed Reference-Frame × Granularity (RFxG) taxonomy framework and four novel faithfulness metrics to systematically evaluate explanation quality across both dimensions, applied to ten saliency methods, four model architectures, and three datasets.

Result: Demonstrated critical limitations in existing evaluation metrics that prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. Comprehensive evaluation framework provides tools for user-intent-driven assessment.

Conclusion: The work provides conceptual foundation and practical tools to develop visual explanations that are faithful to model behavior and meaningfully aligned with human understanding and inquiry complexity.

Abstract: Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise (“Why this prediction?”) and contrastive (“Why this and not an alternative?”) explanations.Granularity: Ranging from fine-grained class-level (e.g., “Why Husky?”) to coarse-grained group-level (e.g., “Why Dog?”) interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.

[373] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: REVISOR is a novel multimodal reflection framework that enhances long-form video understanding by enabling cross-modal collaborative reflection between text and visual information, addressing limitations of purely text-based reflection mechanisms.

DetailsMotivation: Purely text-based reflection mechanisms have limitations in long-form video understanding due to insufficient visual information rethinking and lack of cross-modal interaction capabilities, which prevent full integration of visual information during reflection.

Method: Proposed REVISOR framework with Dual Attribution Decoupled Reward (DADR) mechanism integrated into GRPO training strategy, enabling MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities.

Result: Significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

Conclusion: REVISOR framework successfully addresses the limitations of text-only reflection in long-form video understanding by enabling cross-modal collaborative reflection, demonstrating substantial improvements in multimodal reasoning capabilities.

Abstract: Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

[374] Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

Main category: cs.CV

TL;DR: Ocean is an object-centric 3D semantic scene completion framework that decomposes scenes into individual object instances using MobileSAM for instance segmentation, then uses specialized attention and diffusion modules to improve semantic occupancy prediction.

DetailsMotivation: Existing vision-based 3D semantic scene completion approaches overlook fine-grained object-level details, leading to semantic and geometric ambiguities in complex environments like autonomous driving scenarios.

Method: 1) Use MobileSAM for instance mask extraction from input images; 2) 3D Semantic Group Attention module with linear attention for object-centric feature aggregation; 3) Global Similarity-Guided Attention module to handle segmentation errors; 4) Instance-aware Local Diffusion module for feature refinement in BEV space.

Result: Achieves state-of-the-art performance on SemanticKITTI (17.40 mIoU) and SSCBench-KITTI360 (20.28 mIoU) benchmarks.

Conclusion: Object-centric decomposition enables more accurate semantic occupancy prediction in complex 3D scenes, demonstrating superior performance over ego-centric approaches.

Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

[375] Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack

Chenyang Li, Wenbing Tang, Yihao Huang, Sinong Simon Zhan, Ming Hu, Xiaojun Jia, Yang Liu

Main category: cs.CV

TL;DR: ILA is a black-box adversarial attack framework that manipulates indoor lighting to disrupt Vision-and-Language Navigation agents, revealing their vulnerability to realistic lighting variations.

DetailsMotivation: Existing adversarial evaluations use unrealistic perturbations like unusual textures, while real-world agents are more likely to encounter natural lighting variations in indoor environments.

Method: Proposes two attack modes: SILA (static constant lighting intensity) and DILA (dynamic lighting changes at critical moments), both manipulating global illumination in a black-box manner.

Result: ILA significantly increases failure rates and reduces trajectory efficiency across two state-of-the-art VLN models and three navigation tasks.

Conclusion: VLN agents have previously unrecognized vulnerabilities to realistic indoor lighting variations, highlighting the need for more robust navigation systems.

Abstract: Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.

[376] Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts

Sheng Liu, Yuanzhi Liang, Jiepeng Wang, Sidan Du, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: Uni-Inter is a unified framework for human motion generation that handles human-human, human-object, and human-scene interactions using a single architecture with Unified Interactive Volume representation.

DetailsMotivation: Existing methods rely on task-specific designs with limited generalization, creating a need for a unified approach that can handle diverse interaction scenarios.

Method: Introduces Unified Interactive Volume (UIV) - a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field, enabling consistent relational reasoning and joint-wise probabilistic prediction.

Result: Achieves competitive performance across three interaction tasks and demonstrates strong generalization to novel entity combinations.

Conclusion: Unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

Abstract: We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

[377] uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee

Main category: cs.CV

TL;DR: A lightweight framework for multilingual vision-language alignment that trains only a small projection module using English as semantic anchors, achieving significant improvements in underrepresented languages without requiring multilingual image-text pairs.

DetailsMotivation: CLIP models perform well in English but struggle with low-resource languages due to scarce multilingual image-text data. Existing multilingual vision-language models show poor retrieval performance in underrepresented languages like Czech, Finnish, Croatian, Hungarian, and Romanian.

Method: Freeze both pretrained image encoder and multilingual text encoder, train only a compact 1.7M-parameter projection module using contrastive loss over English representations as semantic anchors. No image-text pairs or text-text pairs required.

Result: Significant gains in retrieval performance for five underrepresented languages (Czech, Finnish, Croatian, Hungarian, Romanian) on Crossmodal-3600 benchmark. Robust multilingual alignment achieved even with limited supervision.

Conclusion: The pivot-based, parameter-efficient alignment strategy effectively enables inclusive multimodal learning for underrepresented languages, demonstrating that minimal training with English anchors can bridge the multilingual gap in vision-language models.

Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

[378] MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

Zhenying Fang, Richang Hong

Main category: cs.CV

TL;DR: MGCA-Net is a multi-grained category-aware network that improves open-vocabulary temporal action localization by recognizing actions at multiple granularities, achieving state-of-the-art performance on THUMOS'14 and ActivityNet-1.3 benchmarks.

DetailsMotivation: Existing methods recognize action categories at a single granularity, which degrades recognition accuracy for both base and novel action categories in open-vocabulary settings.

Method: Proposes MGCA-Net with four components: localizer for action proposals, action presence predictor, conventional classifier for base actions at snippet granularity, and coarse-to-fine classifier for novel actions at video and proposal granularities.

Result: Achieves state-of-the-art performance on THUMOS'14 and ActivityNet-1.3 benchmarks, and also achieves state-of-the-art results under Zero-Shot Temporal Action Localization setting.

Conclusion: Multi-grained category awareness effectively enhances localization performance for both base and novel action categories in open-vocabulary temporal action localization.

Abstract: Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier’s awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.

[379] Automated Road Distress Detection Using Vision Transformersand Generative Adversarial Networks

Cesar Portocarrero Rodriguez, Laura Vandeweyen, Yosuke Yamamoto

Main category: cs.CV

TL;DR: The paper explores computer vision methods for road distress segmentation using GAN-generated synthetic data and compares CNN vs transformer-based models, finding that synthetic data improves performance and MaskFormer outperforms CNN.

DetailsMotivation: America's infrastructure is graded poorly (C for infrastructure, D for roads), with inefficient manual inspection methods. Real-time visual data from autonomous vehicles presents an opportunity for computer vision-based road monitoring to guide infrastructure rehabilitation.

Method: Evaluated synthetic data generated with GANs for model training, applied CNNs for road distress segmentation, and examined transformer-based MaskFormer model.

Result: GAN-generated data improves model performance. MaskFormer outperforms CNN model in mAP50 and IoU metrics.

Conclusion: Computer vision methods using synthetic data and transformer-based models show promise for efficient road distress monitoring and infrastructure management.

Abstract: The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.

[380] DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang, Susanto Rahardja

Main category: cs.CV

TL;DR: DiffPixelFormer is a Transformer-based model for RGB-D indoor semantic segmentation that enhances intra-modal representations and inter-modal interactions through a novel Differential-Shared Inter-Modal module and dynamic fusion strategy, achieving state-of-the-art performance on SUN RGB-D and NYUDv2 benchmarks.

DetailsMotivation: Existing RGB-D fusion methods for indoor semantic segmentation rely on computationally intensive cross-attention mechanisms and inadequately model intra- and inter-modal feature relationships, leading to imprecise feature alignment and limited discriminative representation.

Method: Proposes DiffPixelFormer with Intra-Inter Modal Interaction Block (IIMIB) that uses self-attention for intra-modal dependencies and Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues. Includes dynamic fusion strategy to balance modality contributions based on scene characteristics.

Result: Achieves mIoU scores of 54.28% on SUN RGB-D and 59.95% on NYUDv2, outperforming DFormer-L by 1.78% and 2.75% respectively.

Conclusion: DiffPixelFormer effectively addresses limitations in existing RGB-D fusion methods by enabling fine-grained pixel-level cross-modal alignment and dynamic modality balancing, demonstrating superior performance for indoor semantic segmentation tasks.

Abstract: Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

[381] ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

Main category: cs.CV

TL;DR: The paper proposes Pretext-GRPO and ViSS-R1 frameworks to enhance visual-centric video reasoning in MLLMs by integrating self-supervised learning with pretext tasks that force models to process transformed visual inputs.

DetailsMotivation: Current R1-based MLLMs often underutilize rich visual information in video tasks, leading to shortcut learning and increased hallucination. There's a need for more robust visual-centric video understanding.

Method: 1) Pretext-GRPO: Self-supervised reinforcement learning algorithm that rewards correct solving of pretext tasks on transformed visual inputs. 2) ViSS-R1: Framework integrating pretext-task-based self-supervised learning into MLLM’s R1 post-training, forcing models to reason about transformed visual inputs by processing both pretext questions and user queries.

Result: Comprehensive evaluations on six video reasoning benchmarks demonstrate the effectiveness and superiority of Pretext-GRPO and ViSS-R1 for complex video reasoning.

Conclusion: The proposed frameworks successfully address visual underutilization in video reasoning by compelling models to non-trivially process visual information through pretext tasks, leading to more robust video understanding.

Abstract: Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM’s R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

[382] SOMA: Feature Gradient Enhanced Affine-Flow Matching for SAR-Optical Registration

Haodong Wang, Tao Zhuo, Xiuwei Zhang, Hanlin Yin, Wencong Wu, Yanning Zhang

Main category: cs.CV

TL;DR: SOMA is a dense registration framework that integrates structural gradient priors into deep features and uses a hybrid matching strategy for SAR-Optical image registration, achieving significant improvements in precision.

DetailsMotivation: SAR and optical images have different imaging mechanisms and visual characteristics, making pixel-level registration challenging. Deep learning hasn't effectively leveraged gradient cues that were crucial in traditional handcrafted descriptors for this task.

Method: Proposes SOMA with two key components: Feature Gradient Enhancer (FGE) that embeds multi-scale, multi-directional gradient filters into feature space using attention and reconstruction, and Global-Local Affine-Flow Matcher (GLAM) that combines affine transformation and flow-based refinement in a coarse-to-fine architecture.

Result: Significantly improves registration precision, increasing CMR@1px by 12.29% on SEN1-2 dataset and 18.50% on GFGE_SO dataset. Shows strong robustness and generalization across diverse scenes and resolutions.

Conclusion: SOMA effectively bridges the gap between traditional gradient-based approaches and deep learning for SAR-Optical registration, demonstrating that integrating structural gradient priors into deep features substantially improves performance.

Abstract: Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.

[383] Monocular 3D Lane Detection via Structure Uncertainty-Aware Network with Curve-Point Queries

Ruixin Liu, Zejian Yuan

Main category: cs.CV

TL;DR: MonoUnc is a BEV-free monocular 3D lane detector that explicitly models aleatoric uncertainty using local lane structures and 3D Gaussian distributions, achieving state-of-the-art performance on major benchmarks.

DetailsMotivation: Existing monocular 3D lane detection methods fail to capture structural variations and aleatoric uncertainty from inherent observation noise, relying on simplified geometric assumptions like independent point predictions or global planar modeling.

Method: Projects 3D lanes to front-view space as parametric curves, generates curve-point query embeddings for 3D predictions, models each lane segment as a 3D Gaussian parameterized by local structure and uncertainty, and uses a novel 3D Gaussian matching loss.

Result: Outperforms previous state-of-the-art methods across all benchmarks on ONCE-3DLanes and OpenLane datasets under stricter evaluation criteria, with proposed comprehensive metrics for global and local error quantification.

Conclusion: MonoUnc effectively addresses aleatoric uncertainty in monocular 3D lane detection through local structural modeling and uncertainty estimation, demonstrating superior performance and robustness in real-world scenarios.

Abstract: Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird’s-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at https://github.com/lrx02/MonoUnc.

[384] FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu

Main category: cs.CV

TL;DR: A novel framework that transfers knowledge from Segment Anything 2 (SAM2) pre-trained on natural images to electron microscopy domain for neural structure segmentation, achieving state-of-the-art performance.

DetailsMotivation: To address challenges in neural structure segmentation from EM images including intricate morphologies, low signal-to-noise ratios, and scarce annotations by leveraging priors from visual foundation models trained on natural images.

Method: Proposes a framework that extracts features from SAM2, uses Feature-Guided Attention module to guide a lightweight Fine-Grained Encoder with semantic cues, and employs a dual-affinity decoder generating coarse and refined affinity maps.

Result: Achieves performance comparable to SOTA with frozen SAM2 weights, and significantly outperforms existing SOTA methods after fine-tuning on EM data.

Conclusion: Transferring representations pre-trained on natural images combined with domain-adaptive guidance effectively addresses specific challenges in neuron segmentation.

Abstract: Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.

[385] RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Reeshoon Sayera, Akash Kumar, Sirshapan Mitra, Prudvi Kamtam, Yogesh S Rawat

Main category: cs.CV

TL;DR: RobustGait is a framework for evaluating the robustness of appearance-based gait recognition systems against real-world corruptions and silhouette variability across multiple dimensions including perturbation types, silhouette extraction methods, and model architectures.

DetailsMotivation: Current appearance-based gait recognition systems achieve strong performance on controlled datasets but lack systematic evaluation of their robustness to real-world corruptions and silhouette variability, which is crucial for deployment-ready systems.

Method: Developed RobustGait framework with four-dimensional evaluation: perturbation types (digital, environmental, temporal, occlusion), silhouette extraction methods (segmentation/parsing networks), model architectural capacities, and deployment scenarios. Evaluated 15 corruption types at 5 severity levels across multiple datasets including CASIA-B, CCPG, SUSTech1K, and MEVID.

Result: Key findings: 1) RGB-level noise better reflects real-world degradation and shows distortion propagation through silhouette extraction; 2) Gait accuracy is highly sensitive to silhouette extractor biases; 3) Robustness depends on both perturbation type and architectural design; 4) Noise-aware training and knowledge distillation improve performance.

Conclusion: RobustGait provides comprehensive robustness evaluation revealing critical vulnerabilities in gait recognition systems and demonstrates that robustness-enhancing strategies like noise-aware training and knowledge distillation can improve performance toward deployment-ready systems.

Abstract: Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.

[386] Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, Jian Pu

Main category: cs.CV

TL;DR: AdaptiveAD is a novel autonomous driving architecture that decouples scene perception from ego status using a dual-branch structure to prevent over-reliance on ego information, achieving state-of-the-art planning performance.

DetailsMotivation: Existing modular autonomous driving systems over-rely on ego status as a shortcut, which hinders generalization and robust scene understanding. The premature fusion of ego status in BEV encoders dominates downstream planning.

Method: Proposes a dual-branch architecture: one branch performs scene-driven reasoning without ego status in BEV encoder, another conducts ego-driven planning. Uses scene-aware fusion module to integrate decisions, with path attention mechanism and auxiliary tasks (BEV unidirectional distillation, autoregressive online mapping).

Result: Achieves state-of-the-art open-loop planning performance on nuScenes dataset, significantly mitigates over-reliance on ego status, and exhibits impressive generalization across diverse scenarios.

Conclusion: AdaptiveAD effectively addresses the ego status over-reliance problem through architectural decoupling and adaptive fusion, demonstrating superior generalization and robust planning capabilities.

Abstract: Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.

[387] MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

Main category: cs.CV

TL;DR: MergeSlide treats lifelong learning on whole slide images as a model merging problem using vision-language pathology foundation models, employing orthogonal continual merging and task-to-class prompt-aligned inference to prevent catastrophic forgetting.

DetailsMotivation: To reduce resource requirements for training on gigabyte-scale whole slide images by enabling sequential learning on cancer-related tasks without catastrophic forgetting, eliminating the need for repeated data transfer and processing.

Method: Uses class-aware prompts to define new tasks, fine-tunes with MLP-free backbone for few epochs, applies orthogonal continual merging strategy, and introduces Task-to-Class Prompt-aligned (TCP) inference for class-incremental learning where task identity is unknown.

Result: Outperforms both rehearsal-based continual learning and vision-language zero-shot baselines on experiments with six TCGA datasets.

Conclusion: MergeSlide provides an effective framework for lifelong learning on WSIs by treating it as a model merging problem, demonstrating superior performance while mitigating catastrophic forgetting through orthogonal merging and prompt-aligned inference.

Abstract: Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

[388] CapeNext: Rethinking and refining dynamic support information for category-agnostic pose estimation

Yu Zhu, Dan Zeng, Shuiwang Li, Qijun Zhao, Qiaomu Shen, Bo Tang

Main category: cs.CV

TL;DR: CapeNext overcomes limitations in Category-Agnostic Pose Estimation by integrating hierarchical cross-modal interaction and dual-stream feature refinement, achieving state-of-the-art performance on MP-100 dataset.

DetailsMotivation: Fixed textual keypoint descriptions in CAPE suffer from polysemy-induced cross-category ambiguity and insufficient discriminability for fine-grained intra-category variations.

Method: Proposes a framework with hierarchical cross-modal interaction and dual-stream feature refinement to enhance joint embeddings with class-level and instance-specific cues from both text and images.

Result: CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin on MP-100 dataset, regardless of network backbone.

Conclusion: The proposed hierarchical cross-modal interaction and dual-stream refinement effectively address the limitations of static joint embeddings in CAPE, leading to significant performance improvements.

Abstract: Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept “leg” exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.

[389] GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang, Jing Liu, Xiaohong Liu, Guangtao Zhai

Main category: cs.CV

TL;DR: GeoX-Bench is a comprehensive benchmark for evaluating large multimodal models on cross-view geo-localization and pose estimation tasks using 10,859 panoramic-satellite image pairs and 755,976 QA pairs.

DetailsMotivation: To explore and evaluate LMMs' capabilities in cross-view geo-localization and pose estimation domains, which remain unexplored despite potential benefits for navigation, autonomous driving, and outdoor robotics.

Method: Created GeoX-Bench with 10,859 panoramic-satellite image pairs from 128 cities in 49 countries and 755,976 QA pairs, then evaluated 25 state-of-the-art LMMs and explored instruction-tuning capabilities.

Result: Current LMMs achieve impressive performance in geo-localization tasks but decline significantly on more complex pose estimation tasks. Instruction-tuning on GeoX-Bench data significantly improves cross-view geo-sense abilities.

Conclusion: GeoX-Bench reveals critical gaps in LMMs’ pose estimation capabilities and demonstrates that instruction-tuning can substantially enhance cross-view geo-localization performance, highlighting areas for future improvement.

Abstract: Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

[390] PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking

Seungjae Kim, SeungJoon Lee, MyeongAh Cho

Main category: cs.CV

TL;DR: PlugTrack adaptively fuses Kalman filters and data-driven motion predictors through multi-perceptive motion understanding, achieving state-of-the-art performance in multi-object tracking without modifying existing predictors.

DetailsMotivation: Real-world tracking scenarios involve both linear and non-linear motion patterns, but existing methods either use computationally efficient Kalman filters that fail on non-linear motion or data-driven predictors that suffer from limited generalization and computational overhead.

Method: Proposes PlugTrack framework that uses multi-perceptive motion analysis to generate adaptive blending factors for fusing Kalman filter and data-driven motion predictors, leveraging their complementary strengths.

Result: Achieves significant performance gains on MOT17/MOT20 datasets and state-of-the-art performance on DanceTrack dataset, with Kalman filter outperforming data-driven predictors in up to 34% of cases even in non-linear motion scenarios.

Conclusion: PlugTrack successfully bridges classical and modern motion prediction paradigms through adaptive fusion, demonstrating that combining both approaches is more effective than using either alone in multi-object tracking.

Abstract: Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.

[391] Low-Level Dataset Distillation for Medical Image Enhancement

Fengzhi Xu, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou, Yi Zhang

Main category: cs.CV

TL;DR: Proposes the first low-level dataset distillation method for medical image enhancement that addresses the many-to-many mapping challenge by leveraging anatomical priors and preserving pixel-level fidelity while ensuring patient privacy.

DetailsMotivation: Existing dataset distillation methods mainly target high-level tasks with many-to-one mappings, but low-level medical image enhancement requires pixel-level fidelity and faces underdetermined problems with many-to-many mappings. Current methods have high training/storage costs and privacy concerns.

Method: Uses shared anatomical prior from representative patients as initialization, then personalizes with Structure-Preserving Personalized Generation (SPG) module. Constructs task-specific training pairs and aligns gradients between distilled data and raw patient data while preserving privacy.

Result: Develops a method that enables effective low-level dataset distillation for medical image enhancement, maintaining pixel-level fidelity while compressing dataset size and protecting patient privacy.

Conclusion: The proposed approach successfully addresses the challenges of low-level dataset distillation in medical imaging by leveraging anatomical priors and gradient alignment, enabling practical deployment with reduced costs and preserved privacy.

Abstract: Medical image enhancement is clinically valuable, but existing methods require large-scale datasets to learn complex pixel-level mappings. However, the substantial training and storage costs associated with these datasets hinder their practical deployment. While dataset distillation (DD) can alleviate these burdens, existing methods mainly target high-level tasks, where multiple samples share the same label. This many-to-one mapping allows distilled data to capture shared semantics and achieve information compression. In contrast, low-level tasks involve a many-to-many mapping that requires pixel-level fidelity, making low-level DD an underdetermined problem, as a small distilled dataset cannot fully constrain the dense pixel-level mappings. To address this, we propose the first low-level DD method for medical image enhancement. We first leverage anatomical similarities across patients to construct the shared anatomical prior based on a representative patient, which serves as the initialization for the distilled data of different patients. This prior is then personalized for each patient using a Structure-Preserving Personalized Generation (SPG) module, which integrates patient-specific anatomical information into the distilled dataset while preserving pixel-level fidelity. For different low-level tasks, the distilled data is used to construct task-specific high- and low-quality training pairs. Patient-specific knowledge is injected into the distilled data by aligning the gradients computed from networks trained on the distilled pairs with those from the corresponding patient’s raw data. Notably, downstream users cannot access raw patient data. Instead, only a distilled dataset containing abstract training information is shared, which excludes patient-specific details and thus preserves privacy.

[392] DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Zhangjie Fu

Main category: cs.CV

TL;DR: DGS-Net is a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components for AI-generated image detection, using gradient-space decomposition to separate harmful and beneficial optimization directions.

DetailsMotivation: Address the issue of catastrophic forgetting in fine-tuning large multimodal models like CLIP for synthetic content detection, which degrades pre-trained priors and limits cross-domain generalization.

Method: Propose Distillation-guided Gradient Surgery Network (DGS-Net) with gradient-space decomposition that separates harmful and beneficial descent directions, projecting task gradients onto orthogonal complement of harmful directions and aligning with beneficial ones distilled from frozen CLIP encoder.

Result: Outperforms state-of-the-art approaches by average margin of 6.6 across 50 generative models, achieving superior detection performance and generalization across diverse generation techniques.

Conclusion: DGS-Net effectively addresses catastrophic forgetting in fine-tuning pre-trained models for AI-generated image detection, enabling unified optimization of prior preservation and irrelevant suppression for improved cross-domain generalization.

Abstract: The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

[393] Computer Vision based group activity detection and action spotting

Narthana Sivalingam, Santhirarajah Sivasthigan, Thamayanthi Mahendranathan, G. M. R. I. Godaliyadda, M. P. B. Ekanayake, H. M. V. R. Herath

Main category: cs.CV

TL;DR: A framework combining Mask R-CNN for actor localization, multiple backbone networks for feature extraction, and Actor Relation Graphs with GCNs for modeling interactions, achieving improved group activity recognition performance.

DetailsMotivation: Group activity detection is challenging due to complex human interactions, occlusions, and appearance variations over time in multi-person scenes.

Method: Uses Mask R-CNN for actor localization with bounding boxes and masks, multiple backbones (Inception V3, MobileNet, VGG16) for feature extraction, RoIAlign for spatial alignment, mask fusion for refined features, Actor Relation Graphs with similarity measures (NCC, SAD, dot product), and Graph Convolutional Networks for relationship reasoning.

Result: Experiments on Collective Activity dataset show improved recognition performance in both crowded and non-crowded scenarios through mask-based feature refinement, robust similarity search, and graph neural network reasoning.

Conclusion: Integration of segmentation, feature extraction, and relational graph reasoning shows strong potential for complex video understanding tasks.

Abstract: Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.

[394] Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing

Shuaibin Fan, Senming Zhong, Wenchao Yan, Minglong Xue

Main category: cs.CV

TL;DR: Proposes an unsupervised image dehazing method using implicit neural degradation representation that balances local feature learning with global consistency, achieving competitive performance on complex scenes.

DetailsMotivation: Existing dehazing methods struggle to balance fine-grained feature representation of inhomogeneous haze distribution with global consistency modeling in complex scenes.

Method: Uses implicit neural representation to model haze degradation as a continuous function, combining channel-independent and channel-dependent mechanisms inspired by Kolmogorov-Arnold theorem, with dense residual enhancement module.

Result: Achieves competitive dehazing performance on various public and real-world datasets with good visual perception in complex scenes.

Conclusion: The proposed unsupervised method effectively models haze degradation through implicit neural representation, eliminating redundant information and dependence on explicit feature extraction while maintaining high-quality image restoration.

Abstract: Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.

[395] Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining

Zhaocheng Yu, Kui Jiang, Junjun Jiang, Xianming Liu, Guanglu Sun, Yi Xiao

Main category: cs.CV

TL;DR: MPHM network integrates CLIP and DINOv2 priors with hierarchical Mamba architecture for superior image deraining, achieving state-of-the-art performance with 0.57 dB PSNR gain.

DetailsMotivation: Existing deraining methods struggle with preserving semantic and spatial details, which is critical for applications like autonomous driving and video surveillance.

Method: Multi-Prior Hierarchical Mamba network that combines macro-semantic textual priors (CLIP) and micro-structural visual priors (DINOv2) with progressive Priors Fusion Injection and Fourier-enhanced dual-path hierarchical Mamba modules.

Result: Achieves 0.57 dB PSNR gain on Rain200H dataset and demonstrates superior generalization on real-world rainy scenarios.

Conclusion: MPHM network effectively addresses semantic and structural detail preservation in image deraining through synergistic integration of heterogeneous priors and hierarchical feature modeling.

Abstract: Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM’s state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.

[396] A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features

Hanzhe Liang, Jie Zhou, Can Gao, Bingyang Guo, Jinbao Wang, Linlin Shen

Main category: cs.CV

TL;DR: Proposes Rotationally Invariant Features (RIF) framework for 3D anomaly detection using Point Coordinate Mapping and CTF-Net to handle orientation/position variations in point clouds.

DetailsMotivation: Existing 3D anomaly detection methods struggle with point clouds that have changes in orientation and position, as these variations cause significant feature inconsistencies.

Method: Uses Point Coordinate Mapping (PCM) to map points into rotationally invariant space, then employs lightweight Convolutional Transform Feature Network (CTF-Net) with transfer learning pre-training using 3D data augmentation.

Result: Achieves 17.7% average P-AUROC improvement on Anomaly-ShapeNet and 1.6% improvement on Real3D-AD dataset, with strong generalization when combined with traditional methods.

Conclusion: RIF framework demonstrates robust performance and strong generalization ability for 3D anomaly detection, showing great potential for industrial applications.

Abstract: 3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.

[397] Semi-Supervised Multi-Task Learning for Interpretable Quality As- sessment of Fundus Images

Lucas Gabriel Telesco, Danila Nejamkin, Estefanía Mata, Francisco Filizzola, Kevin Wignall, Lucía Franco Troilo, María de los Angeles Cenoz, Melissa Thompson, Mercedes Leguía, Ignacio Larrabide, José Ignacio Orlando

Main category: cs.CV

TL;DR: Hybrid semi-supervised learning approach for retinal image quality assessment that combines manual overall quality labels with pseudo-labels for quality details, improving interpretability without extra manual labeling.

DetailsMotivation: Current RIQA tools only classify overall image quality without indicating specific acquisition defects to guide recapture, mainly due to high annotation costs for detailed quality labels.

Method: Multi-task framework using ResNet-18 backbone that combines manual labels for overall quality with pseudo-labels generated by a Teacher model trained on small dataset, then fine-tunes pre-trained model.

Result: Outperforms single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, 0.778 vs. 0.763 on DeepDRiD), matches/surpasses existing methods, achieves performance comparable to Teacher model for detail prediction, and performs similarly to experts on newly annotated subset.

Conclusion: Semi-supervised approach improves overall quality assessment and provides interpretable feedback on capture conditions (illumination, clarity, contrast) at no extra labeling cost, offering clinically actionable outputs to guide image recapture.

Abstract: Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.

[398] CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

Yuqi Zhang, Guanying Chen, Jiaxing Chen, Chuanyu Fu, Chuan Huang, Shuguang Cui

Main category: cs.CV

TL;DR: CloseUpShot is a diffusion-based framework for close-up novel view synthesis from sparse inputs using point-conditioned video diffusion, addressing challenges in fine-grained detail capture through hierarchical warping, occlusion-aware noise suppression, and global structure guidance.

DetailsMotivation: Existing approaches for 3D scene reconstruction and novel view synthesis struggle with fine-grained details in close-up scenarios under sparse input views due to limited information and modest viewpoint variations.

Method: Proposes hierarchical warping and occlusion-aware noise suppression to enhance conditioning image quality, and introduces global structure guidance using dense fused point clouds to provide consistent geometric context to the diffusion process.

Result: Extensive experiments on multiple datasets show that CloseUpShot outperforms existing approaches, particularly in close-up novel view synthesis, validating the effectiveness of the proposed design.

Conclusion: The method successfully addresses the limitations of sparse input views in close-up scenarios through improved conditioning techniques and global geometric constraints, demonstrating superior performance in novel view synthesis.

Abstract: Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

[399] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara

Main category: cs.CV

TL;DR: VIR-Bench is a new benchmark with 200 travel videos that tests multimodal LLMs’ ability to reconstruct itineraries from long-distance travel videos, revealing current models’ limitations in handling extended geospatial-temporal data.

DetailsMotivation: Current video benchmarks focus on indoor scenes or short outdoor activities, leaving long-distance travel challenges unexplored. Mastering extended geospatial-temporal trajectories is crucial for real-world applications like embodied-AI planning and navigation.

Method: Created VIR-Bench with 200 travel videos and framed itinerary reconstruction as a challenging task to evaluate MLLMs’ geospatial-temporal intelligence.

Result: State-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, showing difficulty in handling videos spanning extended spatial and temporal scales. A prototype travel-planning agent developed using insights from VIR-Bench showed markedly improved itinerary recommendations.

Conclusion: VIR-Bench effectively benchmarks MLLMs’ geospatial-temporal capabilities and translates into concrete performance gains in practical applications like travel planning.

Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs’ geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

[400] Region-Point Joint Representation for Effective Trajectory Similarity Learning

Hao Long, Silin Zhou, Lisi Chen, Shuo Shang

Main category: cs.CV

TL;DR: RePo is a novel trajectory similarity computation method that jointly encodes region-wise and point-wise features to capture both spatial context and fine-grained movement patterns, achieving 22.2% accuracy improvement over SOTA methods.

DetailsMotivation: Current learning-based methods for trajectory similarity computation fail to leverage the comprehensive spectrum of trajectory information, particularly the combination of spatial context and fine-grained moving patterns.

Method: RePo maps GPS trajectories to grid sequences for region-wise representation (structural and semantic features) and uses three expert networks for point-wise representation (local, correlation, continuous patterns). A router network fuses point-wise features, which are combined with region-wise features via cross-attention to produce final embeddings. Training uses contrastive loss with hard negative samples.

Result: RePo achieves an average accuracy improvement of 22.2% over state-of-the-art baselines across all evaluation metrics.

Conclusion: The joint encoding of region-wise and point-wise features effectively captures comprehensive trajectory information, significantly improving trajectory similarity computation performance.

Abstract: Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2% over SOTA baselines across all evaluation metrics.

[401] VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: VEIL is a jailbreak framework that uses benign-looking prompts with implicit cues to bypass safety filters in text-to-video models, achieving 23% higher attack success rates than previous methods.

DetailsMotivation: Prior jailbreak attacks on T2V models use obvious adversarial perturbations that are easy to detect. This work aims to create stealthy attacks using plausible prompts that exploit models' cross-modal associations.

Method: VEIL uses modular prompts with three components: neutral scene anchors, latent auditory triggers (innocuous audio descriptions), and stylistic modulators (cinematic directives). Attack generation is formulated as constrained optimization solved via guided search.

Result: Extensive experiments on 7 T2V models show VEIL achieves 23% improvement in average attack success rate for commercial models compared to prior methods.

Conclusion: Benign-looking prompts with rich implicit cues can effectively jailbreak T2V models by exploiting their cross-modal associative patterns, revealing critical safety vulnerabilities.

Abstract: Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models’ cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger’s effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.

[402] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model

Fei Kong

Main category: cs.CV

TL;DR: gDDCM extends DDCM to work with various diffusion models including DDPM, Score-Based Models, Consistency Models, and Rectified Flow for image compression.

DetailsMotivation: DDCM was limited to DDPM only and couldn't be applied to other diffusion models, creating a need for a more generalized approach.

Method: Extends DDCM by replacing random noise in backward process with noise from specific sets according to predefined rules, making it compatible with multiple diffusion model types.

Result: Successfully generalized DDCM to various diffusion models and achieved improved performance on CIFAR-10 and LSUN Bedroom datasets.

Conclusion: gDDCM provides a generalized framework that enables image compression across multiple diffusion model variants with enhanced performance.

Abstract: Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.

[403] MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang, Xinjie Jiang, Xiang Wan, Fenglei Fan, Yongquan Zhang, Feiwei Qin, Changmiao Wan

Main category: cs.CV

TL;DR: MedGEN-Bench is a comprehensive medical multimodal benchmark addressing limitations of existing benchmarks by focusing on contextually intertwined instructions requiring cross-modal reasoning and open-ended generation across 6,422 expert-validated image-text pairs spanning multiple modalities and clinical tasks.

DetailsMotivation: Address limitations in existing medical visual benchmarks that use ambiguous queries, oversimplify diagnostic reasoning into closed-ended shortcuts, and overlook image generation capabilities in text-centric evaluation paradigms.

Method: Developed MedGEN-Bench with 6,422 expert-validated image-text pairs across 6 imaging modalities, 16 clinical tasks, and 28 subtasks in three formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation.

Result: Systematically evaluated 10 compositional frameworks, 3 unified models, and 5 VLMs using a novel three-tier assessment framework integrating pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring.

Conclusion: MedGEN-Bench advances medical AI research by enabling sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond multiple-choice formats to better align with clinical workflow needs.

Abstract: As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.

[404] Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

Main category: cs.CV

TL;DR: DTPQA is a VQA benchmark for evaluating VLMs’ perception in traffic scenes, featuring distance annotations to analyze performance degradation with object distance.

DetailsMotivation: To enable trustworthy VLMs in automated driving by assessing their perception capabilities in complex traffic scenes, especially at long distances (30+ meters).

Method: Created DTPQA benchmark with synthetic (simulator-based) and real-world components, each sample containing image, question, ground truth answer, and object distance annotation.

Result: Provides a specialized dataset and tools to evaluate VLM perception performance degradation as object distance increases in traffic scenarios.

Conclusion: DTPQA enables isolated evaluation of VLM perception capabilities in safety-critical driving contexts, addressing the need for robust long-range scene understanding.

Abstract: The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

[405] WinMamba: Multi-Scale Shifted Windows in State Space Model for 3D Object Detection

Longhui Zheng, Qiming Xia, Xiaolu Chen, Zhaoliang Liu, Chenglu Wen

Main category: cs.CV

TL;DR: WinMamba is a novel Mamba-based 3D object detection backbone that uses window-scale-adaptive modules and window-shift strategies to efficiently capture long-range dependencies while maintaining spatial information.

DetailsMotivation: Current 3D object detection methods struggle to balance computational efficiency with capturing long-range spatial dependencies. Mamba-based models offer efficiency but existing approaches lose spatial information through axis-aligned scanning in fixed windows.

Method: Proposed WinMamba backbone with stacked WinMamba blocks featuring: 1) window-scale-adaptive module for multi-scale voxel feature compensation, 2) learnable positional encoding, and 3) window-shift strategy for rich contextual cues in linear state space.

Result: Extensive experiments on KITTI and Waymo datasets show WinMamba significantly outperforms baseline methods. Ablation studies confirm the effectiveness of the window-scale-adaptive and window-shift modules.

Conclusion: WinMamba successfully addresses the efficiency-accuracy trade-off in 3D object detection by preserving spatial information while maintaining computational efficiency through Mamba-based architecture with adaptive window strategies.

Abstract: 3D object detection is critical for autonomous driving, yet it remains fundamentally challenging to simultaneously maximize computational efficiency and capture long-range spatial dependencies. We observed that Mamba-based models, with their linear state-space design, capture long-range dependencies at lower cost, offering a promising balance between efficiency and accuracy. However, existing methods rely on axis-aligned scanning within a fixed window, inevitably discarding spatial information. To address this problem, we propose WinMamba, a novel Mamba-based 3D feature-encoding backbone composed of stacked WinMamba blocks. To enhance the backbone with robust multi-scale representation, the WinMamba block incorporates a window-scale-adaptive module that compensates voxel features across varying resolutions during sampling. Meanwhile, to obtain rich contextual cues within the linear state space, we equip the WinMamba layer with a learnable positional encoding and a window-shift strategy. Extensive experiments on the KITTI and Waymo datasets demonstrate that WinMamba significantly outperforms the baseline. Ablation studies further validate the individual contributions of the WSF and AWF modules in improving detection accuracy. The code will be made publicly available.

[406] TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen, Taiping Yao, Shouhong Ding, Jianguo Zhang

Main category: cs.CV

TL;DR: TripleFDS is a novel Scene Text Editing framework that disentangles text style, content, and background attributes for flexible editing, achieving state-of-the-art performance with 44.54 SSIM and 93.58% accuracy.

DetailsMotivation: Previous STE methods struggled with incomplete attribute disentanglement, typically addressing only one aspect like text content editing, which limited controllability and visual consistency.

Method: Uses SCB Synthesis dataset with SCB Groups for triple feature disentanglement, employs inter-group contrastive regularization and intra-sample multi-feature orthogonality, and performs feature remapping to prevent shortcuts and feature leakage.

Result: Achieves state-of-the-art image fidelity (SSIM 44.54) and text accuracy (93.58%) on mainstream STE benchmarks, trained on 125,000 SCB Groups.

Conclusion: TripleFDS enables more flexible editing including style replacement and background transfer while maintaining superior performance compared to previous methods.

Abstract: Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the “SCB Group”, a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent “shortcut” phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

[407] Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

Main category: cs.CV

TL;DR: CSIP-ReID introduces the first skeleton-driven pretraining framework for video person re-identification, using skeleton sequences instead of text to capture fine-grained temporal motion cues through contrastive learning and multimodal fusion.

DetailsMotivation: Existing multimodal pretraining for video ReID relies on text which poorly captures fine-grained temporal motion - an essential cue for distinguishing identities. Current approaches lack genuine multimodal pretraining and text is insufficient for capturing motion details.

Method: Two-stage framework: (1) Contrastive learning to align skeleton and visual features at sequence level, (2) Dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, and Skeleton Guided Temporal Modeling (SGTM) module to distill temporal cues from skeletons into visual features.

Result: Achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID) and exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods.

Conclusion: CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning by using skeletons as a spatiotemporally informative modality aligned with video frames.

Abstract: Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

[408] Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu

Main category: cs.CV

TL;DR: Foresee is a training-free MLLM-based pipeline for image forgery detection and localization that eliminates additional training, achieves superior localization accuracy, and provides comprehensive textual explanations across various tampering types.

DetailsMotivation: Existing IFDL methods struggle with generalization across datasets and offer limited interpretability, while current MLLM-based approaches require expensive training and fail to leverage vanilla MLLMs' inherent potential for forensic analysis.

Method: Proposes Foresee pipeline with type-prior-driven strategy and Flexible Feature Detector (FFD) module specifically designed for copy-move manipulations, enabling training-free inference while effectively utilizing vanilla MLLMs.

Result: Extensive experiments show Foresee achieves superior localization accuracy and richer textual explanations compared to existing methods, with strong generalization across copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing.

Conclusion: Foresee successfully unleashes vanilla MLLMs’ potential for image forensics without additional training, offering an efficient and interpretable solution that outperforms existing IFDL methods across diverse tampering types.

Abstract: With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

[409] THIR: Topological Histopathological Image Retrieval

Zahra Tabatabaei, Jon Sporring

Main category: cs.CV

TL;DR: THIR is a novel unsupervised CBMIR framework that uses topological data analysis (Betti numbers from persistent homology) to retrieve histopathological breast cancer images without training, outperforming supervised methods while being fast and scalable on standard CPUs.

DetailsMotivation: Breast cancer caused 685,000 deaths in 2020, highlighting the need for early diagnosis and accurate clinical decision making through efficient image retrieval systems.

Method: Extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding loop evolution as compact feature vectors, then performs similarity retrieval by computing distances between these topological descriptors.

Result: Outperforms state-of-the-art supervised and unsupervised methods on BreaKHis dataset, processing entire dataset in under 20 minutes on standard CPU.

Conclusion: THIR provides a fast, scalable, training-free solution for clinical image retrieval that doesn’t require annotated datasets or GPU resources.

Abstract: According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.

[410] HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution

Chao Yang, Boqian Zhang, Jinghao Xu, Guang Jiang

Main category: cs.CV

TL;DR: HDW-SR is a diffusion-based super-resolution method that uses wavelet decomposition to better restore high-frequency details, replacing standard U-Net with wavelet-based downsampling and sparse cross-attention for explicit high-frequency guidance.

DetailsMotivation: Existing diffusion-based super-resolution methods often produce blurred fine details due to insufficient guidance in the high-frequency domain, limiting their ability to restore sharp image details.

Method: Proposes HDW-SR with wavelet decomposition, performing diffusion only on residual maps, using wavelet-based downsampling for multi-scale frequency decomposition, sparse cross-attention between frequency subbands, and a Dynamic Thresholding Block for high-frequency refinement.

Result: HDW-SR achieves competitive super-resolution performance on both synthetic and real-world datasets, particularly excelling in recovering fine-grained image details compared to existing methods.

Conclusion: The wavelet-based diffusion approach with explicit high-frequency guidance effectively addresses the blurring issue in diffusion-based super-resolution, enabling superior restoration of fine image details.

Abstract: Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.

[411] GenTract: Generative Global Tractography

Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander

Main category: cs.CV

TL;DR: GenTract is the first generative model for global tractography that learns a direct mapping from diffusion MRI to complete, anatomically plausible streamlines, achieving significantly higher precision than existing methods.

DetailsMotivation: Existing local tractography methods suffer from error accumulation and high false positive rates, while global methods are computationally expensive. There's a need for more efficient and accurate tractography approaches.

Method: Frames tractography as a generative task using both diffusion-based and flow matching paradigms to learn direct mapping from dMRI to complete streamlines.

Result: GenTract achieves precision 2.1x higher than the next-best method (TractOracle), with even greater advantages in challenging low-resolution and noisy settings where it outperforms competitors by an order of magnitude.

Conclusion: GenTract represents a promising solution for global tractography, producing high-precision tractograms on research-grade data while maintaining reliability on imperfect, lower-resolution data.

Abstract: Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract’s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

[412] PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Liu, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing

Main category: cs.CV

TL;DR: PAN is a general world model that predicts future states through high-quality video simulation using language actions and history, combining LLM-based reasoning with video diffusion for coherent long-term dynamics.

DetailsMotivation: Existing world models are limited to restricted domains with poor generalization, while video generators lack causal control and interactivity needed for purposeful reasoning.

Method: Uses Generative Latent Prediction (GLP) architecture with autoregressive latent dynamics based on LLM for text-based reasoning, combined with video diffusion decoder for visual reconstruction.

Result: Achieves strong performance in action-conditioned simulation, long-horizon forecasting, and simulative reasoning across diverse domains compared to other models.

Conclusion: PAN represents progress toward general world models that enable predictive simulation of future states for reasoning and acting.

Abstract: A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.

[413] Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel

Main category: cs.CV

TL;DR: ViXML is a vision-enhanced extreme multi-label learning framework that efficiently integrates foundation vision models with text-only approaches, achieving state-of-the-art performance while maintaining computational efficiency.

DetailsMotivation: Foundation models have revolutionized AI but remain underutilized in Extreme Multi-label Classification (XMC), which requires balancing efficiency and performance with extremely large label spaces.

Method: ViXML efficiently integrates foundation vision models by pooling a single embedding per image and combines decoder-only models with vision capabilities, limiting computational growth while unlocking multi-modal capabilities.

Result: ViXML with small encoders outperforms text-only decoders in most cases, showing substantial improvements of up to +8.21% in P@1 on the largest dataset, surpassing previous state-of-the-art methods.

Conclusion: Both decoder-only models and visual information play critical roles in XMC, and their combination through ViXML delivers superior performance while maintaining computational efficiency, demonstrating that visual information can compensate for billions of parameters in text-only models.

Abstract: Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals’ effectiveness, surpassing previous state-of-the-art by up to +8.21% in P@1 on the largest dataset. ViXML’s code is available at https://github.com/DiegoOrtego/vixml.

[414] Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard

Main category: cs.CV

TL;DR: SliDer is a framework that converts raster slide images into editable SVG format using Vision-Language Models, preserving semantic structure and enabling iterative refinement.

DetailsMotivation: Multimedia documents are often distributed as static raster images, limiting editability. Existing geometric raster-vectorization methods fail to preserve high-level semantic structure when applied to complex documents like slides.

Method: Uses Vision-Language Models (VLMs) to detect and extract attributes from individual image and text elements, organizing them into coherent SVG format with iterative refinement during inference.

Result: Achieves reconstruction LPIPS of 0.069 and is preferred by human evaluators in 82.9% of cases compared to strongest zero-shot VLM baseline.

Conclusion: SliDer effectively restores editability to slide documents by converting raster images to structured SVG representations while preserving semantic distinctions between elements.

Abstract: Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

[415] Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

Main category: cs.CV

TL;DR: Proposes Object-Centric 3D Rollout (OCR) to improve video spatial reasoning in MLLMs by introducing structured perturbations to object geometry, forcing holistic scene understanding.

DetailsMotivation: Existing MLLMs struggle with robust video spatial reasoning and exhibit query-locked reasoning, focusing only on explicitly mentioned objects while ignoring contextual cues.

Method: OCR introduces structured perturbations to 3D object geometry during training, degrades object-specific visual cues, projects altered geometry to 2D space, and uses a rollout-based training pipeline with vanilla and region-noisy videos.

Result: Achieves 47.5% accuracy on VSI-Bench with a 3B-parameter model, outperforming several 7B baselines and showing superiority over prior rollout strategies.

Conclusion: OCR effectively addresses query-locked reasoning limitations and enables more robust spatial reasoning in video understanding tasks.

Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

[416] Birth of a Painting: Differentiable Brushstroke Reconstruction

Ying Jiang, Jiayin Lu, Yunuo Chen, Yumeng He, Kui Wu, Yin Yang, Chenfanfu Jiang

Main category: cs.CV

TL;DR: A differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to reproduce the human painting-smudging loop, producing realistic and expressive digital paintings with smooth shading.

DetailsMotivation: Existing generative models focus on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading in painting synthesis.

Method: Differentiable framework that optimizes single- and dual-color Bezier strokes through parallel differentiable paint renderer, with style generation for geometry-conditioned textures and differentiable smudge operator for natural blending. Uses coarse-to-fine optimization strategy for joint optimization of stroke geometry, color, and texture.

Result: Produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances across oil, watercolor, ink, and digital paintings.

Conclusion: Offers a unified model for expressive digital painting creation that faithfully reproduces the human painting-smudging loop with explicit stroke structure and realistic shading.

Abstract: Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: https://yingjiang96.github.io/DiffPaintWebsite/.

[417] Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection

Soyul Lee, Seungmin Baek, Dongbo Min

Main category: cs.CV

TL;DR: MonoDLGD is a difficulty-aware label-guided denoising framework for monocular 3D object detection that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty to improve performance across all difficulty levels.

DetailsMotivation: Monocular 3D object detection suffers from inherent depth ambiguity and existing methods struggle with inaccurate depth estimates while ignoring instance-level detection difficulty factors like occlusion, distance, and truncation.

Method: Proposes a Difficulty-Aware Label-Guided Denoising framework that applies adaptive perturbations (stronger for easier instances, weaker for harder cases) and reconstructs ground-truth labels to provide explicit geometric supervision through joint optimization of label reconstruction and 3D detection.

Result: Extensive experiments on KITTI benchmark demonstrate state-of-the-art performance across all difficulty levels.

Conclusion: MonoDLGD effectively addresses depth ambiguity in monocular 3D detection through geometry-aware representation learning and adaptive difficulty handling, achieving superior performance on challenging benchmarks.

Abstract: Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.

[418] Self-Supervised Ultrasound Screen Detection

Alberto Gomez, Jorge Oliveira, Ramon Casero, Agis Chartsias

Main category: cs.CV

TL;DR: Self-supervised pipeline extracts ultrasound images from monitor photos, bypassing DICOM transfer for rapid algorithm testing.

DetailsMotivation: Ultrasound machines display images on built-in monitors but require DICOM transfer to hospital systems, creating a bottleneck for rapid testing and prototyping of new algorithms.

Method: Proposed a self-supervised pipeline to extract ultrasound images from photographs of the monitor display.

Result: In proof-of-concept study, rectified images retained sufficient visual fidelity to classify cardiac views with balanced accuracy of 0.79 compared to native DICOMs.

Conclusion: The method successfully bypasses the DICOM bottleneck and enables rapid testing and prototyping of ultrasound algorithms using extracted images from monitor photos.

Abstract: Ultrasound (US) machines display images on a built-in monitor, but routine transfer to hospital systems relies on DICOM. We propose a self-supervised pipeline to extract the US image from a photograph of the monitor. This removes the DICOM bottleneck and enables rapid testing and prototyping of new algorithms. In a proof-of-concept study, the rectified images retained enough visual fidelity to classify cardiac views with a balanced accuracy of 0.79 with respect to the native DICOMs.

[419] RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Junhee Lee, ChaeBeen Bang, MyoungChul Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: RefineVAD is a weakly-supervised video anomaly detection framework that jointly models temporal motion patterns and semantic structures of different anomaly types, addressing the oversimplification of treating all anomalies as a single category.

DetailsMotivation: Existing methods oversimplify anomaly space by treating all abnormal events as a single category, ignoring the diverse semantic and temporal characteristics of real-world anomalies. Humans perceive anomalies by jointly interpreting temporal motion patterns and semantic structures.

Method: Two core modules: Motion-aware Temporal Attention and Recalibration (MoTAR) estimates motion salience and adjusts temporal focus using shift-based attention and Transformer modeling. Category-Oriented Refinement (CORE) injects soft anomaly category priors by aligning segment-level features with learnable category prototypes via cross-attention.

Result: Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and demonstrate the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

Conclusion: The framework successfully models both “how” motion evolves and “what” semantic category it resembles, providing a more nuanced approach to weakly-supervised video anomaly detection by leveraging temporal dynamics and semantic structure.

Abstract: Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both “how” motion evolves and “what” semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

[420] Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

Md. Iqbal Hossain, Afia Sajeeda, Neeresh Kumar Perla, Ming Shao

Main category: cs.CV

TL;DR: Proposes a defense method against backdoor attacks in CLIP models by identifying triggers and victim samples using an image segmentation oracle, then fine-tuning with curated data.

DetailsMotivation: Multimodal models like CLIP are vulnerable to backdoor attacks, and existing defenses require extensive retraining without targeting specific affected labels.

Method: Uses an image segmentation oracle to supervise poisoned CLIP outputs, develops algorithms to identify triggers and victim samples, and creates a compact fine-tuning dataset for model rectification.

Result: Extensive experiments show the strategy effectively defends against backdoor attacks in CLIP-based models on visual recognition benchmarks.

Conclusion: The proposed approach successfully enhances CLIP model robustness against backdoor attacks by efficiently identifying triggers and affected components, enabling targeted rectification.

Abstract: The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle’’ is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle’s knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.

[421] End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang

Main category: cs.CV

TL;DR: PAVE-Net is the first fully end-to-end framework for multi-person 2D pose estimation in videos, eliminating heuristic operations like detection and NMS through a novel pose-aware attention mechanism and spatiotemporal modeling.

DetailsMotivation: Existing methods rely on two-stage pipelines with heuristic operations (detection, RoI cropping, NMS) that limit accuracy and efficiency. There's a need for end-to-end solutions that can handle complex temporal associations across frames.

Method: Proposes PAVE-Net with spatial encoder for intra-frame relations and spatiotemporal pose decoder for global dependencies. Uses pose-aware attention mechanism for selective feature aggregation across frames and explicit modeling of spatiotemporal dependencies among keypoints.

Result: Achieves 6.0 mAP improvement on PoseTrack2017 compared to prior image-based end-to-end methods, and delivers competitive accuracy with state-of-the-art two-stage video approaches while offering significant efficiency gains.

Conclusion: PAVE-Net successfully demonstrates the viability of end-to-end multi-person video pose estimation, outperforming existing methods in both accuracy and efficiency through its novel pose-aware attention and spatiotemporal modeling approach.

Abstract: Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet

[422] Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification

Linhan Zhou, Shuang Li, Neng Dong, Yonghang Tai, Yafei Zhang, Huafeng Li

Main category: cs.CV

TL;DR: HPL is a unified framework that uses hierarchical prompt learning to jointly optimize image-to-image and text-to-image person re-identification tasks through task-aware prompt modeling and cross-modal alignment.

DetailsMotivation: Existing methods treat image-to-image and text-to-image ReID separately, leading to representation entanglement and suboptimal performance. A unified approach is needed to address the distinct challenges of discriminative identity learning (I2I) and cross-modal semantic alignment (T2I).

Method: Proposed Hierarchical Prompt Learning with Task-Routed Transformer using dual classification tokens, hierarchical prompt generation with identity-level and instance-level pseudo-text tokens, and Cross-Modal Prompt Regularization for semantic alignment.

Result: Achieved state-of-the-art performance on multiple ReID benchmarks for both I2I and T2I tasks, validating the effectiveness of the unified framework.

Conclusion: The HPL framework successfully unifies I2I and T2I person re-identification through hierarchical prompt learning, demonstrating that joint optimization with task-aware prompts and cross-modal regularization leads to superior performance on both tasks.

Abstract: Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.

[423] 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Keze Wang

Main category: cs.CV

TL;DR: 3DAlign-DAER is a unified framework that improves text-3D cross-modal alignment through dynamic attention policy and efficient retrieval strategy, addressing fine-grained semantic-geometric alignment challenges in large-scale 3D databases.

DetailsMotivation: Existing methods struggle to align fine-grained textual semantics with detailed geometric structures and degrade significantly when scaling to large-scale 3D databases.

Method: Proposes dynamic attention policy (DAP) with Hierarchical Attention Fusion module for token-to-point attentions, optimized via Monte Carlo tree search with hybrid reward signals. Also introduces Efficient Retrieval Strategy (ERS) for hierarchical searching in large-scale embedding spaces.

Result: Extensive experiments demonstrate superior performance on diverse benchmarks, outperforming traditional methods in both accuracy and efficiency.

Conclusion: 3DAlign-DAER effectively addresses fine-grained text-3D alignment challenges and scales well to large databases, with plans to release codes, models, and the constructed Align3D-2M dataset.

Abstract: Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.

[424] VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang

Main category: cs.CV

TL;DR: VVS accelerates visual autoregressive generation by skipping partial verification steps in speculative decoding, reducing target model forward passes by 2.8x while maintaining quality.

DetailsMotivation: Visual AR models have high inference latency due to next-token-prediction paradigm. Existing speculative decoding approaches don't sufficiently reduce forward passes due to their "draft one step, verify one step" paradigm.

Method: Proposes VVS framework with three modules: verification-free token selector with dynamical truncation, token-level feature caching and reuse, and fine-grained skipped step scheduling. Leverages visual token interchangeability to skip verification steps.

Result: Reduces number of target model forward passes by 2.8x compared to vanilla AR decoding while maintaining competitive generation quality.

Conclusion: VVS offers superior speed-quality trade-off over conventional SD frameworks and has strong potential to reshape the speculative decoding paradigm for visual AR generation.

Abstract: Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its “draft one step, then verify one step” paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage’s characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.

[425] Hybrid-Domain Adaptative Representation Learning for Gaze Estimation

Qida Tan, Hongyu Yang, Wenchao Du

Main category: cs.CV

TL;DR: HARL is a hybrid-domain adaptive learning framework for robust gaze estimation that disentangles gaze-relevant features from low-quality images by aligning with high-quality near-eye images and uses sparse graph fusion to model head-pose constraints.

DetailsMotivation: Current appearance-based gaze estimation methods suffer significant performance degradation in cross-domain scenarios due to interference from gaze-irrelevant factors like expressions, wearables, and image quality.

Method: Proposes Hybrid-domain Adaptative Representation Learning (HARL) that: 1) Disentangles gaze-relevant representation by aligning features from high-quality near-eye images in unsupervised domain adaptation, 2) Uses sparse graph fusion module to explore geometric constraints between gaze direction and head-pose.

Result: Achieves state-of-the-art accuracy of 5.02° on EyeDiap, 3.36° on MPIIFaceGaze, and 9.26° on Gaze360 datasets, with competitive cross-dataset performance.

Conclusion: HARL framework effectively learns robust gaze representation by combining hybrid-domain adaptation with geometric constraints, significantly improving cross-domain gaze estimation performance without additional computational costs.

Abstract: Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.

[426] MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI

Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni

Main category: cs.CV

TL;DR: MRIQT is a 3D diffusion model that enhances portable ultra-low-field MRI (0.064T) to high-field quality using physics-consistent simulation and perceptual losses, achieving 15.3% PSNR improvement over state-of-the-art methods.

DetailsMotivation: Portable ultra-low-field MRI offers accessible neonatal neuroimaging but suffers from poor diagnostic quality compared to high-field MRI, limiting its clinical utility.

Method: 3D conditional diffusion framework combining K-space degradation for physics-consistent simulation, v-prediction with classifier-free guidance, and SNR-weighted 3D perceptual loss using volumetric attention-UNet architecture.

Result: Achieved 15.3% PSNR improvement over state-of-the-art methods, with 85% of outputs rated as good quality by physicians with clear pathology present.

Conclusion: MRIQT enables high-fidelity enhancement of portable ultra-low-field MRI for reliable neonatal brain assessment through diffusion-based image quality transfer.

Abstract: Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.

[427] MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

Junjie Wu, Guohong Fu

Main category: cs.CV

TL;DR: MMD-Thinker is a two-stage framework that uses adaptive multi-dimensional thinking to improve multimodal misinformation detection, addressing limitations of general-purpose MLLMs through task-specific instruction tuning and reinforcement learning.

DetailsMotivation: Multimodal misinformation is evolving rapidly with AIGC, posing serious societal threats. Current MLLM-based detectors suffer from insufficient reasoning (lack of task-specific knowledge) and reasoning biases (single thinking mode struggles with complex misinformation).

Method: 1) Develop tailor-designed thinking modes for misinformation detection; 2) Task-specific instruction tuning to inject thinking modes into MLLMs; 3) Reinforcement learning with mixed advantage function to enhance reasoning; 4) Construct MMR dataset with 8K+ image-text pairs with reasoning processes.

Result: Achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets while maintaining flexible inference and token usage.

Conclusion: MMD-Thinker effectively addresses reasoning limitations in multimodal misinformation detection through adaptive multi-dimensional thinking and specialized training approaches.

Abstract: Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.

[428] Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao, Han Yang, Haozhe Xing, Lihe Zhang, Boxue Hou

Main category: cs.CV

TL;DR: RFMNet improves camouflaged object detection by fusing multi-stage salient image features with camouflage features using overlapped windows cross-attention and progressive decoding.

DetailsMotivation: To enhance camouflaged object detection by leveraging rich salient image features and performing better local feature fusion for improved object matching.

Method: Proposes RFMNet with multi-stage feature fusion, overlapped windows cross-attention for local information matching, and Referring Feature Aggregation module for progressive decoding.

Result: Achieves state-of-the-art performance on Ref-COD benchmark through extensive experiments.

Conclusion: The proposed multi-context fusion approach with local attention mechanisms effectively improves camouflaged object detection performance.

Abstract: Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

[429] Alpha Divergence Losses for Biometric Verification

Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis

Main category: cs.CV

TL;DR: The paper introduces two novel margin-based α-divergence losses (Q-Margin and A3M) for face and speaker verification that integrate angular margins into α-divergence loss functions, achieving significant performance gains especially at low false acceptance rates.

DetailsMotivation: Current margin-based softmax losses like CosFace and ArcFace drive performance in verification tasks, but α-divergence losses offer compelling alternatives for sparse solutions. However, integrating angular margins crucial for verification into α-divergence is not straightforward.

Method: Proposed two distinct ways to integrate angular margin into α-divergence: Q-Margin (margin in reference measure) and A3M (margin in logits). Addressed A3M training instability with prototype re-initialization strategy.

Result: Achieved significant performance gains on IJB-B and IJB-C face verification benchmarks and strong performance on VoxCeleb speaker verification. Models significantly outperform baselines at low false acceptance rates.

Conclusion: The proposed margin-based α-divergence losses provide crucial capability for practical high-security applications where minimizing false authentications is paramount, demonstrating superior performance especially at low FARs.

Abstract: Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.

[430] Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges

Junlong Li, Huaiyuan Xu, Sijie Cheng, Kejun Wu, Kim-Hui Yap, Lap-Pui Chau, Yi Wang

Main category: cs.CV

TL;DR: This paper introduces EgoProceAssist, an egocentric procedural AI assistant for daily tasks, defining three core tasks: error detection, procedural learning, and question answering in first-person view.

DetailsMotivation: To address the gap in AI assistants for step-by-step procedural support in daily tasks from a first-person perspective, leveraging advances in vision language models and egocentric perception.

Method: Comprehensive review of current techniques, datasets, and metrics; novel experiments evaluating representative VLM-based methods; technical analysis and gap identification.

Result: Established a new taxonomy for egocentric procedural assistance, identified limitations of existing VLM-based approaches, and created an active repository for ongoing research collection.

Conclusion: The work highlights current challenges and suggests future research directions for developing effective egocentric procedural AI assistants, with continuous updates through a public repository.

Abstract: Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant

[431] SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression

Keshav Gupta, Akshat Sanghvi, Shreyas Reddy Palley, Astitva Srivastava, Charu Sharma, Avinash Sharma

Main category: cs.CV

TL;DR: SymGS introduces symmetry-aware compression for 3D Gaussian Splatting, using learnable mirrors to eliminate redundant primitives and achieve 108× compression while preserving rendering quality.

DetailsMotivation: 3D Gaussian Splatting has high memory footprint that scales with scene complexity, reaching several gigabytes. Existing compression methods exploit primitive-level redundancy but don't leverage symmetry.

Method: Proposes SymGS framework with learnable mirrors to eliminate local and global reflective redundancies. Works as plug-and-play enhancement to existing compression methods like HAC.

Result: Achieves 1.66× compression over HAC across benchmarks (up to 3× on large scenes) and 108× overall compression of 3DGS scenes while maintaining rendering quality.

Conclusion: Symmetry-aware compression through learnable mirrors effectively reduces memory footprint of 3D Gaussian Splatting beyond existing methods.

Abstract: 3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf{\textit{SymGS}}, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf{\color{cyan}{symgs.github.io}}

[432] Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding

Main category: cs.CV

TL;DR: SpatialSky-Bench is a benchmark for evaluating VLMs’ spatial intelligence in UAV navigation, revealing performance gaps. Sky-VLM, trained on the SpatialSky-Dataset, achieves SOTA results.

DetailsMotivation: Existing VLMs' spatial intelligence capabilities in UAV scenarios are unexplored, raising concerns about their effectiveness in dynamic environments.

Method: Created SpatialSky-Bench with 13 subcategories across Environmental Perception and Scene Understanding. Developed SpatialSky-Dataset with 1M samples and trained Sky-VLM for UAV spatial reasoning.

Result: Mainstream VLMs show unsatisfactory performance in complex UAV navigation. Sky-VLM achieves state-of-the-art performance across all benchmark tasks.

Conclusion: Sky-VLM paves the way for developing VLMs suitable for UAV scenarios, addressing spatial intelligence gaps in existing models.

Abstract: Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

[433] Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models

Noam Tsfaty, Avishai Weizman, Liav Cohen, Moshe Tshuva, Yehudit Aperstein

Main category: cs.CV

TL;DR: A dual-backbone framework combining CNN and transformer features with top-k pooling achieves 90.7% AUC for video-level anomaly detection in surveillance videos.

DetailsMotivation: To address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision without frame-level annotations.

Method: Proposes a dual-backbone framework that combines convolutional neural network (CNN) and transformer representations through top-k pooling mechanism.

Result: Achieves 90.7% area under the curve (AUC) on the UCF-Crime dataset for anomaly detection.

Conclusion: The dual-backbone approach with top-k pooling effectively detects diverse anomalies in surveillance videos using only video-level supervision.

Abstract: We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.

[434] SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting

Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang, Zongqian Zhan

Main category: cs.CV

TL;DR: SF-Recon directly reconstructs lightweight building surfaces from multi-view images using 3D Gaussian Splatting and normal-gradient-guided optimization, eliminating the need for post-hoc mesh simplification.

DetailsMotivation: Conventional multi-view geometry pipelines are cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. There's a need for direct lightweight building surface reconstruction.

Method: Train initial 3D Gaussian Splatting field, use normal-gradient-guided Gaussian optimization to select primitives aligned with building structures, apply multi-view edge-consistency pruning, and perform multi-view depth-constrained Delaunay triangulation.

Result: SF-Recon achieves substantially fewer faces and vertices while maintaining computational efficiency, directly reconstructing lightweight building models from multi-view imagery.

Conclusion: The method successfully reconstructs lightweight building surfaces without post-hoc simplification, demonstrating improved efficiency and structural fidelity compared to conventional pipelines.

Abstract: Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/

[435] Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu

Main category: cs.CV

TL;DR: DTO-Humans: A novel method for generating scene-consistent multi-person human mesh pseudo-ground-truth using depth-conditioned optimization and a new metric-aware HMR network.

DetailsMotivation: Current multi-person human mesh recovery methods lack scene-level consistency due to single-person-centric pseudo-ground-truth generation, leading to conflicting depths and scales within the same image.

Method: Introduces Depth-conditioned Translation Optimization (DTO) for joint refinement of camera-space translations using anthropometric priors and monocular depth cues, plus Metric-Aware HMR network with relative metric loss.

Result: Created DTO-Humans dataset with 0.56M high-quality, scene-consistent multi-person images (avg 4.8 persons/image), achieving state-of-the-art performance on relative depth reasoning and mesh recovery.

Conclusion: The proposed approach successfully addresses scene-level consistency in multi-person mesh recovery and enables direct metric-scale estimation through joint optimization and metric-aware learning.

Abstract: Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.

[436] UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Junwei Yu, Trevor Darrell, XuDong Wang

Main category: cs.CV

TL;DR: UnSAMv2 enables granularity control in segmentation without human annotations, improving SAM-2 performance across multiple benchmarks using minimal unlabeled data.

DetailsMotivation: SAM models lack granularity control, requiring manual refinement and dense annotations which are expensive and ambiguous.

Method: Extends divide-and-conquer strategy with mask-granularity pair discovery and granularity control embedding for continuous scale control.

Result: Achieves significant improvements: NoC90 (5.69→4.75), 1-IoU (58.0→73.1), AR1000 (49.6→68.3) across 11 benchmarks with only 6K unlabeled images.

Conclusion: Small unlabeled datasets with granularity-aware self-supervised learning can unlock vision foundation model potential.

Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

[437] TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: TabFlash is an efficient MLLM for table understanding that uses progressive question conditioning, token pruning, and token focusing to create compact yet informative visual features, achieving SOTA performance with reduced computational costs.

DetailsMotivation: Table images have unique challenges including redundant background regions and the need for question-specific focus. Existing MLLMs overlook these characteristics, resulting in uninformative and redundant visual representations.

Method: Three key approaches: 1) Progressive question conditioning - injects questions into Vision Transformer layers with increasing frequency; 2) Token pruning - discards background tokens to reduce redundancy; 3) Token focusing - training strategy to concentrate essential information in retained tokens.

Result: TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

Conclusion: The proposed TabFlash model effectively addresses table understanding challenges by generating informative and compact visual features through progressive conditioning, pruning, and focusing strategies, achieving superior efficiency and performance.

Abstract: Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer’s capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

[438] SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design

Yunjie Yu, Jingchen Wu, Junchen Zhu, Chunze Lin, Guibin Chen

Main category: cs.CV

TL;DR: SkyReels-Text is a font-controllable framework for precise poster text editing that enables simultaneous editing of multiple text regions with different typographic styles while preserving non-edited areas, without requiring font labels or fine-tuning.

DetailsMotivation: Current image editing models lack fine-grained, font-aware text manipulation capabilities needed for professional design workflows like poster editing, where precise text modification while preserving visual harmony is crucial.

Method: Users provide cropped glyph patches of desired typography as input. The framework enables simultaneous editing of multiple text regions with distinct fonts, preserving non-edited regions, without requiring font labels or fine-tuning during inference.

Result: Achieves state-of-the-art performance on multiple datasets including handwritten text benchmarks, with superior text fidelity and visual realism, offering unprecedented control over font families and stylistic nuances.

Conclusion: Bridges the gap between general-purpose image editing and professional-grade typographic design by providing precise font-controllable text editing capabilities.

Abstract: Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor fine-tuning during inference: users can simply provide cropped glyph patches corresponding to their desired typography, even if the font is not included in any standard library. Extensive experiments on multiple datasets, including handwrittent text benchmarks, SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.

[439] CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving

Enhui Ma, Lijun Zhou, Tao Tang, Jiahuan Zhang, Junpeng Jiang, Zhan Zhang, Dong Han, Kun Zhan, Xueyang Zhang, XianPeng Lang, Haiyang Sun, Xia Zhou, Di Lin, Kaicheng Yu

Main category: cs.CV

TL;DR: DriveSora enables automated self-correction of autonomous driving failures by generating synthetic training data aligned with 3D layouts, reducing collisions by 27-39%.

DetailsMotivation: Address the robustness issues in end-to-end autonomous driving systems caused by rare but safety-critical failure cases (long-tail problem).

Method: Propose CorrectAD system with PM-Agent for data requirements, DriveSora for generating spatiotemporally consistent videos aligned with 3D layouts, and automated annotation.

Result: Corrects 62.5% and 49.8% of failure cases on nuScenes and in-house datasets, reducing collision rates by 39% and 27% respectively across multiple planners.

Conclusion: The proposed end-to-end model-agnostic pipeline effectively self-corrects autonomous driving failures using generative world models and structured 3D layouts.

Abstract: End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.

[440] DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving

Kaiwen Cai, Xinze Liu, Xia Zhou, Hengtong Hu, Jie Xiang, Luyao Zhang, Xueyang Zhang, Kun Zhan, Yifei Zhan, Xianpeng Lang

Main category: cs.CV

TL;DR: DriveLiDAR4D is a novel LiDAR generation pipeline that produces temporally consistent LiDAR scenes with controllable foreground objects and realistic backgrounds, achieving state-of-the-art performance on nuScenes and KITTI datasets.

DetailsMotivation: Existing 3D LiDAR point cloud generation methods lack sequential generation capabilities and cannot produce accurately positioned foreground objects with realistic backgrounds, limiting their practical applicability in autonomous driving systems.

Method: Proposes DriveLiDAR4D pipeline with multimodal conditions and a novel sequential noise prediction model called LiDAR4DNet, enabling end-to-end sequential generation of LiDAR scenes with full scene manipulation capability.

Result: Achieved FRD score of 743.13 and FVD score of 16.96 on nuScenes dataset, outperforming SOTA method UniScene by 37.2% in FRD and 24.1% in FVD respectively.

Conclusion: DriveLiDAR4D is the first work to address sequential generation of LiDAR scenes with full scene manipulation capability, demonstrating significant improvements over existing methods for autonomous driving applications.

Abstract: The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.

[441] YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Ori Meiraz, Sharon Shalev, Avishai Weizman

Main category: cs.CV

TL;DR: Novel Mixture-of-Experts framework for object detection using adaptive routing among multiple YOLOv9-T experts to improve performance.

DetailsMotivation: To enhance object detection performance by enabling dynamic feature specialization through multiple specialized experts rather than relying on a single model.

Method: Mixture-of-Experts framework with adaptive routing mechanism that dynamically selects among multiple YOLOv9-T experts for different input features.

Result: Achieved higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

Conclusion: The Mixture-of-Experts approach with adaptive routing effectively improves object detection performance by leveraging specialized feature processing from multiple experts.

Abstract: This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

[442] What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao, Lei Huang, Wenjun Wu

Main category: cs.CV

TL;DR: MLMs suffer from visual perception interference, especially in color perception, leading to hallucination risks. A ‘What Color Is It’ dataset is created to trigger and study this issue, with solutions proposed for robustness.

DetailsMotivation: To address the susceptibility of Multimodal Large Models (MLMs) to informational interference in visual perception, particularly color perception, which increases hallucination risks.

Method: Introduce the ‘What Color Is It’ dataset using a simple method to trigger single-modality visual hallucination in MLMs, and investigate underlying causes.

Result: Validation of the hypothesis that MLMs are prone to visual perception interference, with the dataset successfully triggering hallucinations.

Conclusion: Proposed potential solutions to enhance the robustness of MLMs against visual modality hallucinations.

Abstract: With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the “What Color Is It” dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

[443] Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Yevhenii Salii, Volodymyr Kuzin, Sergii Skakun, Zoltan Szantoi

Main category: cs.CV

TL;DR: DelAnyFlow is a resolution-agnostic method for large-scale agricultural field boundary mapping that combines the DelAny instance segmentation model with post-processing to generate topologically consistent vector boundaries, achieving state-of-the-art accuracy and scalability.

DetailsMotivation: Existing methods for agricultural field boundary delineation from satellite imagery often produce incomplete boundaries, merge adjacent fields, and struggle to scale effectively for large-scale applications.

Method: DelAnyFlow combines the DelAny instance segmentation model (based on YOLOv11 backbone) trained on the FBIS 22M dataset with structured post-processing, merging, and vectorization sequence. FBIS 22M contains 672,909 multi-resolution image patches and 22.9 million validated field instances.

Result: DelAny model achieves over 100% higher mAP and 400x faster inference than SAM2. DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km²) in under 6 hours, delineating 3.75M fields at 5m and 5.15M at 2.5m resolution - significantly more than existing operational products.

Conclusion: DelAnyFlow provides a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data, with strong zero-shot generalization capabilities and support for national-scale applications.

Abstract: Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.

[444] VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task

Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: VOPE introduces a new method to evaluate hallucinations in LVLMs during voluntary imagination tasks (like story writing) by checking if models correctly interpret the presence of imagined objects in their own responses.

DetailsMotivation: Current hallucination research focuses on factual tasks where any content not in the image is considered hallucination, but this approach is inappropriate for voluntary imagination tasks where novel content generation is expected.

Method: VOPE uses recheck-based questions to evaluate how LVLMs interpret the presence of imagined objects in their own responses, measuring consistency between model interpretation and actual object presence in images.

Result: Most LVLMs hallucinate heavily during voluntary imagination and perform poorly on presence evaluation for imagined objects; existing hallucination mitigation methods show limited effectiveness.

Conclusion: Voluntary imagination tasks present significant hallucination challenges that current methods cannot adequately address, highlighting an important research direction for future work.

Abstract: Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs’ hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model’s interpretation and the object’s presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.

[445] FUSE: A Flow-based Mapping Between Shapes

Lorenzo Olearo, Giulio Viganò, Daniele Baieri, Filippo Maggioli, Simone Melzi

Main category: cs.CV

TL;DR: A neural flow-matching framework for 3D shape mapping that is efficient, invertible, and works across different shape representations without requiring large training datasets.

DetailsMotivation: To create a computationally efficient and versatile shape matching method that works across different 3D representations (point clouds, meshes, SDFs, volumes) without extensive data-driven training.

Method: Represents 3D shapes as probability distributions via continuous invertible flows from a fixed anchor distribution, then composes inverse and forward flows to map between source and target shapes using task-tailored embeddings.

Result: Achieves high coverage and accuracy in shape matching benchmarks, and also shows promising results in UV mapping and point cloud registration tasks.

Conclusion: The flow-matching framework provides an effective, modality-agnostic approach for 3D shape mapping that works well across diverse representations and challenging scenarios.

Abstract: We introduce a novel neural representation for maps between 3D shapes based on flow-matching models, which is computationally efficient and supports cross-representation shape matching without large-scale training or data-driven procedures. 3D shapes are represented as the probability distribution induced by a continuous and invertible flow mapping from a fixed anchor distribution. Given a source and a target shape, the composition of the inverse flow (source to anchor) with the forward flow (anchor to target), we continuously map points between the two surfaces. By encoding the shapes with a pointwise task-tailored embedding, this construction provides an invertible and modality-agnostic representation of maps between shapes across point clouds, meshes, signed distance fields (SDFs), and volumetric data. The resulting representation consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching. Beyond shape matching, our framework shows promising results in other tasks, including UV mapping and registration of raw point cloud scans of human bodies.

[446] InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Lipeng Wang, Hongxing Fan, Haohua Chen, Zehuan Huang, Lu Sheng

Main category: cs.CV

TL;DR: InterMoE is a novel framework using Dynamic Temporal-Selective Mixture of Experts for high-fidelity 3D human interaction generation, achieving state-of-the-art performance by preserving individual characteristics while ensuring semantic fidelity.

DetailsMotivation: Existing methods fail to preserve unique individual characteristics or fully adhere to textual descriptions in human interaction generation, which is valuable for virtual reality and robotics applications.

Method: Built on Dynamic Temporal-Selective Mixture of Experts with a routing mechanism that uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts, allowing dynamic capacity determination and focus on critical temporal features.

Result: Achieves state-of-the-art performance, reducing FID scores by 9% on InterHuman dataset and 22% on InterX dataset for individual-specific high-fidelity 3D human interaction generation.

Conclusion: InterMoE effectively addresses the challenges of preserving individual characteristics and semantic fidelity in human interaction generation through its specialized expert routing mechanism.

Abstract: Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

[447] Language-Guided Invariance Probing of Vision-Language Models

Jae Joong Lee

Main category: cs.CV

TL;DR: LGIP benchmark evaluates VLMs’ linguistic robustness by testing invariance to paraphrases and sensitivity to semantic flips, revealing performance gaps not captured by standard metrics.

DetailsMotivation: Current VLMs show strong zero-shot performance but their reliability under controlled linguistic perturbations is unclear, requiring systematic evaluation beyond conventional accuracy scores.

Method: Language-Guided Invariance Probing (LGIP) uses 40k MS COCO images with human captions to automatically generate paraphrases and rule-based semantic flips (object category, color, count changes), measuring invariance error, semantic sensitivity gap, and positive-rate statistics.

Result: EVA02-CLIP and large OpenCLIP variants show favorable invariance-sensitivity balance, while SigLIP/SigLIP2 exhibit high invariance error and often prefer flipped captions over human descriptions, especially for object and color edits.

Conclusion: LGIP provides crucial diagnostic for VLM linguistic robustness, revealing failures invisible to standard retrieval metrics and highlighting the need for systematic evaluation of linguistic sensitivity.

Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

[448] Mapping the Vanishing and Transformation of Urban Villages in China

Wenyu Zhang, Yao Tong, Yiqiu Liu, Rui Cao

Main category: cs.CV

TL;DR: Deep learning framework monitors urban village redevelopment in China using remote sensing imagery, revealing prolonged processes and three transformation pathways.

DetailsMotivation: Lack of systematic evaluation of whether demolished urban villages have been effectively reused, raising concerns about efficacy and sustainability of current redevelopment practices.

Method: Semantic segmentation of multi-temporal remote sensing imagery to map UV boundaries, then classifying post-demolition land use into six categories across four representative Chinese cities.

Result: UV redevelopment processes were frequently prolonged; transitions primarily in peripheral areas; three spatiotemporal transformation pathways identified: synchronized, delayed, and gradual optimization.

Conclusion: UV redevelopment is fragmented, complex and nonlinear, requiring tiered and context-sensitive planning strategies for more inclusive, efficient, and sustainable urban renewal.

Abstract: Urban villages (UVs), informal settlements embedded within China’s urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning-based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi-temporal remote sensing imagery is first used to map evolving UV boundaries, and then post-demolition land use is classified into six categories based on the “remained-demolished-redeveloped” phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China’s four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi’an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context-sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.

[449] Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems

Jeffrey Wen, Rizwan Ahmad, Philip Schniter

Main category: cs.CV

TL;DR: Proposes an asymptotically minimax approach for multi-target conformal prediction in ill-posed imaging inverse problems, providing tight prediction intervals with joint marginal coverage.

DetailsMotivation: Existing conformal prediction methods only handle scalar estimation targets, but practical applications often involve multiple targets, creating a need for multi-target uncertainty quantification.

Method: Developed an asymptotically minimax approach to multi-target conformal prediction that ensures joint marginal coverage while providing tight prediction intervals.

Result: Numerical demonstrations using synthetic and MRI data show benefits over existing multi-target conformal prediction methods.

Conclusion: The proposed minimax approach effectively addresses multi-target uncertainty quantification in imaging inverse problems and can be applied to various applications including multi-metric image quality assessment and multi-task uncertainty quantification.

Abstract: In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.

[450] Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

Farhin Farhad Riya, Shahinul Hoque, Jinyuan Stella Sun, Olivera Kotevska

Main category: cs.CV

TL;DR: Adversarial color perturbations in federated learning can manipulate model saliency maps without affecting accuracy, compromising interpretability while maintaining correct predictions.

DetailsMotivation: To reveal vulnerabilities in model interpretability systems, showing that correct predictions don't guarantee faithful explanations, and that interpretability itself can be an attack surface in safety-critical applications.

Method: Proposed Chromatic Perturbation Module that systematically crafts adversarial examples by altering color contrast between foreground and background to disrupt explanation fidelity, accumulating perturbations across federated learning training rounds.

Result: Attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% across multiple datasets, showing standard training pipelines fail to detect explanation degradation.

Conclusion: Interpretability can be compromised independently of accuracy, challenging the assumption that correct predictions imply faithful explanations, especially in federated learning where subtle color perturbations are hard to detect.

Abstract: As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model’s saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model’s internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.

[451] BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse

Yuanchao Wang, Tian Qin, Eduardo Valle, Bruno Abrahao

Main category: cs.CV

TL;DR: BootOOD is a self-supervised OOD detection framework that synthesizes pseudo-OOD features from ID data and uses radius-based classification on feature norms to handle semantically similar OOD samples.

DetailsMotivation: Existing OOD detectors struggle when OOD samples are semantically similar to in-distribution classes, which is critical for safety-sensitive applications.

Method: BootOOD synthesizes pseudo-OOD features through transformations of ID representations and uses a lightweight auxiliary head for radius-based classification on feature norms, leveraging Neural Collapse properties.

Result: BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy on CIFAR-10, CIFAR-100, and ImageNet-200.

Conclusion: BootOOD provides an effective self-supervised framework for OOD detection that handles semantically challenging cases by focusing on feature norm differences rather than orthogonal subspaces.

Abstract: Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.

[452] TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

Sining Chen, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: TSE-Net is a semi-supervised learning framework for monocular height estimation that uses teacher-student-exam networks to leverage unlabeled data, addressing data scarcity through pseudo-label generation and filtering.

DetailsMotivation: Monocular height estimation is limited by expensive labeled data acquisition. The scarcity of high-quality annotations hinders model generalization and performance, motivating the use of unlabeled data through semi-supervised learning.

Method: TSE-Net integrates teacher, student, and exam networks. The teacher generates pseudo-labels through joint regression and classification with hierarchical bi-cut strategy for long-tailed height distribution. The student learns from pseudo-labels, while the exam stabilizes performance as a temporal ensemble.

Result: The proposed pipeline was evaluated on three datasets spanning different resolutions and imaging modalities, demonstrating improved performance through semi-supervised learning.

Conclusion: TSE-Net effectively addresses data scarcity in monocular height estimation by leveraging unlabeled data through a semi-supervised framework, improving model generalization and performance while reducing dependency on expensive labeled data.

Abstract: Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.

[453] Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation

Ziyang Huang, Jiagang Chen, Jin Liu, Shunping Ji

Main category: cs.CV

TL;DR: Opt3DGS enhances 3D Gaussian Splatting optimization with a two-stage process using adaptive exploration and curvature-guided exploitation to overcome local optima entrapment and improve convergence quality.

DetailsMotivation: 3D Gaussian Splatting faces optimization challenges including entrapment in suboptimal local optima and insufficient convergence quality, limiting its rendering performance.

Method: Two-stage optimization: 1) Exploration phase with Adaptive Weighted Stochastic Gradient Langevin Dynamics for global search, 2) Exploitation phase with Local Quasi-Newton Direction-guided Adam optimizer using curvature information.

Result: Extensive experiments show Opt3DGS achieves state-of-the-art rendering quality on diverse benchmark datasets without modifying 3DGS’s underlying representation.

Conclusion: Opt3DGS provides a robust framework that significantly improves 3DGS optimization through enhanced global search and precise convergence, achieving superior rendering performance.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.

[454] Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images

Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang

Main category: cs.CV

TL;DR: NuClass is a multi-scale framework that integrates nuclear morphology and tissue context for cell type classification, using uncertainty-guided fusion and spatial transcriptomics-derived labels to overcome annotation limitations.

DetailsMotivation: Existing tile-based models fail to incorporate broader tissue context that influences cell identity, and available annotations are coarse-grained and unevenly distributed, making fine-grained subtype-level supervision difficult.

Method: NuClass uses two components: Path local (224×224 pixel nuclear morphology) and Path global (1024×1024 pixel neighborhood context), with a learnable gating module and uncertainty-guided objective that directs global path to prioritize uncertain regions. Uses spatial transcriptomics to create marker-guided dataset.

Result: Achieves up to 96% F1 for best-performing class on three fully held-out cohorts, outperforming strong baselines. Provides calibrated confidence estimates and Grad-CAM visualizations.

Conclusion: Multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

Abstract: Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell’s function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

[455] ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement

Xin Xu, Hao Liu, Wei Liu, Wei Wang, Jiayi Wu, Kui Jiang

Main category: cs.CV

TL;DR: Proposes ICLR framework with DIEM and CCL for low-light image enhancement, addressing chrominance-luminance interaction issues and outperforming SOTA methods.

DetailsMotivation: Address limitations in HVI color space for LLIE: distributional differences between chrominance/luminance branches limit complementary feature extraction, luminance errors propagate to chrominance, and weak correlation in homogeneous-color regions causes gradient conflicts with traditional losses.

Method: ICLR framework with Dual-stream Interaction Enhancement Module (DIEM) for complementary information extraction from fusion and enhancement dimensions, and Covariance Correction Loss (CCL) using luminance residual statistics to penalize chrominance errors and balance gradient conflicts via chrominance covariance constraints.

Result: Experimental results on multiple datasets demonstrate that the proposed ICLR framework outperforms state-of-the-art methods in low-light image enhancement.

Conclusion: The ICLR framework effectively addresses chrominance-luminance interaction challenges in HVI color space for LLIE, achieving superior performance through improved complementary feature extraction and gradient conflict resolution.

Abstract: Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.

[456] AtlasMorph: Learning conditional deformable templates for brain MRI

Marianne Rakic, Andrew Hoopes, S. Mazdak Abulnaga, Mert R. Sabuncu, John V. Guttag, Adrian V. Dalca

Main category: cs.CV

TL;DR: A machine learning framework that uses convolutional neural networks to create population-specific brain MRI templates conditioned on attributes like age and sex, with optional anatomical segmentation maps.

DetailsMotivation: Existing medical image templates are computationally expensive to create and often not representative of specific study populations, especially when there are large variations within the population.

Method: Convolutional registration neural networks are used to learn a function that outputs templates conditioned on subject-specific attributes (age, sex), leveraging segmentations when available to produce anatomical segmentation maps for the templates.

Result: The method learns high-quality templates that are representative of populations, and annotated conditional templates enable better registration than unlabeled unconditional templates, outperforming other template construction methods.

Conclusion: The proposed framework efficiently creates population-specific templates that improve registration performance and are more representative than existing templates.

Abstract: Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.

[457] Tissue Aware Nuclei Detection and Classification Model for Histopathology Images

Kesi Xu, Eleni Chiou, Ali Varamesh, Laura Acqualagna, Nasir Rajpoot

Main category: cs.CV

TL;DR: TAND is a novel framework for joint nuclei detection and classification in computational pathology that uses point-level supervision enhanced by tissue mask conditioning, achieving state-of-the-art performance on the PUMA benchmark.

DetailsMotivation: Existing approaches for nuclei detection and classification are hindered by reliance on detailed expert annotations and insufficient use of tissue context, creating a need for methods that reduce annotation burden while improving accuracy.

Method: TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, using semantic tissue probabilities to selectively modulate the classification stream through novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM).

Result: On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods, with remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma.

Conclusion: This is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden in computational pathology.

Abstract: Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.

[458] A Real-Time Driver Drowsiness Detection System Using MediaPipe and Eye Aspect Ratio

Ashlesha G. Sawant, Shreyash S. Kamble, Raj S. Kanade, Raunak N. Kanugo, Tanishq A. Kapse, Karan A. Bhapse

Main category: cs.CV

TL;DR: Development of a real-time driver drowsiness detection system using webcam, facial feature tracking, and Eye Aspect Ratio (EAR) to monitor eye movements and alert drowsy drivers.

DetailsMotivation: Driver fatigue causes thousands of road accidents, fatalities, and injuries annually, necessitating systems to improve road safety by detecting drowsiness signs.

Method: Uses standard webcam with MediaPipe Face Mesh for facial landmark detection and Eye Aspect Ratio (EAR) method to monitor eye closure duration and blink rate for drowsiness detection.

Result: System achieves high accuracy and quick response times in experimental tests, providing a low-cost, high-performance driver monitoring solution.

Conclusion: The system can be effectively integrated into Advanced Driving Assistance Systems (ADAS) to enhance road safety by preventing fatigue-related accidents.

Abstract: One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).

[459] CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Shrenik Patel, Daivik Patel

Main category: cs.CV

TL;DR: CacheFlow is a training-free pipeline for long-form video QA that combines dynamic token dropping with compressive long-term memory to reduce token processing by up to 87% while maintaining answer fidelity.

DetailsMotivation: Current vision-language models struggle with long-form video QA due to growing attention and KV caches, forcing expensive inference or limited sliding windows.

Method: Pairs Dynamic Token Dropping (prunes tokens via cosine similarity to previous frame) with compressive long-term memory (summarizes keys for retrieval, offloads full KV pairs, rehydrates for generation). Uses consensus-based retrieval of Top-K relevant blocks.

Result: Outperforms current baselines on offline and streaming VQA benchmarks while processing up to 87% less tokens. Enables efficient and context-aware long-form video understanding.

Conclusion: CacheFlow provides a practical solution for long-form video understanding that is drop-in, architecture-agnostic, and requires no fine-tuning, making VLMs both efficient and context-aware.

Abstract: Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one’s keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block’s full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

[460] Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo

Main category: cs.CV

TL;DR: Part-X-MLLM is a 3D multimodal LLM that unifies diverse 3D tasks by generating structured programs for part-level bounding boxes, semantic descriptions, and edit commands from RGB point clouds and language prompts.

DetailsMotivation: To create a unified interface for diverse 3D tasks by decoupling symbolic planning from geometric synthesis, allowing any compatible geometry engine to be controlled through language.

Method: Uses a dual-encoder architecture pre-trained to disentangle structure from semantics, instruction-tuned on a large-scale part-centric dataset to autoregressively generate structured token sequences encoding part-level information.

Result: Achieves state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through a single unified interface, producing high-quality structured plans.

Conclusion: The approach successfully enables versatile 3D task execution by treating them as programs in a structured grammar, with the structured output driving downstream geometry-aware modules for part-based operations.

Abstract: We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

[461] PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

Main category: cs.CV

TL;DR: PhysX-Anything is a simulation-ready physical 3D generative framework that creates high-quality 3D assets with geometry, articulation, and physical attributes from single images, enabling direct use in robotics and embodied AI applications.

DetailsMotivation: Existing 3D generation methods overlook physical and articulation properties, limiting their utility in embodied AI. The gap between static visual representations and physically interactive assets needs to be bridged for practical simulation applications.

Method: Proposes a VLM-based physical 3D generative model with a new 3D representation that tokenizes geometry efficiently (193x reduction in tokens). Creates PhysX-Mobility dataset with 2K+ real-world objects and rich physical annotations, expanding object categories by 2x.

Result: Strong generative performance and robust generalization demonstrated on PhysX-Mobility and in-the-wild images. Simulation experiments in MuJoCo-style environment validate that generated assets can be directly used for contact-rich robotic policy learning.

Conclusion: PhysX-Anything substantially empowers embodied AI and physics-based simulation applications by providing sim-ready 3D assets with explicit physical properties, bridging the gap between visual representation and physical interaction.

Abstract: 3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

[462] Distribution Matching Distillation Meets Reinforcement Learning

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, Harry Yang

Main category: cs.CV

TL;DR: DMDR combines Reinforcement Learning with Distribution Matching Distillation to improve few-step diffusion models, achieving better performance than multi-step teachers.

DetailsMotivation: To overcome the performance limitation where few-step distilled models are capped by their multi-step teachers, enabling the student model to exceed teacher performance.

Method: Integrates RL into distillation process using DMD loss as regularization, with dynamic distribution guidance and dynamic renoise sampling strategies.

Result: Achieves leading visual quality and prompt coherence among few-step methods, even exceeding multi-step teacher performance.

Conclusion: DMDR successfully unlocks the capacity of few-step generators through simultaneous distillation and RL, demonstrating superior performance over existing methods.

Abstract: Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

[463] OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Joshua Hansen, Andrew Howe, Patrick Alan Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema

Main category: cs.CV

TL;DR: OlmoEarth is a multimodal, spatio-temporal foundation model for Earth observation that achieves state-of-the-art performance across various benchmarks and real-world tasks.

DetailsMotivation: Earth observation data is unique as it combines spatial, sequential, and multimodal characteristics, requiring specialized foundation models to handle its complexity.

Method: Uses novel self-supervised learning formulation, masking strategy, and loss specifically designed for Earth observation data.

Result: Achieves best performance on 15/24 tasks with embeddings and 19/29 tasks with full fine-tuning, outperforming 12 other foundation models.

Conclusion: OlmoEarth provides an effective foundation model for Earth observation and is deployed as an end-to-end platform for non-profits and NGOs, with open-source code and pre-trained weights available.

Abstract: Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world’s biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.

[464] Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu

Main category: cs.CV

TL;DR: GS-Light is an efficient pipeline for text-guided relighting of 3D Gaussian Splatting scenes using training-free diffusion models and lighting priors from vision-language models.

DetailsMotivation: To enable efficient and accurate text-guided relighting of 3D scenes that can handle complex lighting specifications including direction, color, intensity, and reference objects.

Method: Uses LVLM to parse text prompts into lighting priors, fuses with geometry/semantic constraints to compute illumination maps, generates initial latent codes for multi-view diffusion model, and fine-tunes 3DGS scene with relit appearance.

Result: Demonstrates consistent improvements over state-of-the-art baselines in multi-view consistency, imaging quality, aesthetic score, and semantic similarity across indoor and outdoor scenes.

Conclusion: GS-Light provides an effective training-free solution for high-fidelity text-guided 3D scene relighting with accurate lighting direction control and improved user expectation alignment.

Abstract: We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

[465] TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen

Main category: cs.CV

TL;DR: TiViBench is a hierarchical benchmark for evaluating reasoning capabilities in image-to-video generation models, covering structural, spatial, symbolic, and action reasoning across 24 tasks. VideoTPO is a test-time optimization strategy that uses LLM self-analysis to improve reasoning performance without additional training.

DetailsMotivation: Current video generation benchmarks focus on visual fidelity and temporal coherence but fail to assess higher-order reasoning abilities similar to LLMs, creating a gap in evaluating whether video models can exhibit sophisticated reasoning capabilities.

Method: Proposed TiViBench with 4 reasoning dimensions (structural, spatial, symbolic, action) across 24 tasks and 3 difficulty levels. Also introduced VideoTPO, a test-time strategy using LLM self-analysis on generated candidates to identify strengths/weaknesses for preference optimization.

Result: Commercial models (Sora 2, Veo 3.1) show stronger reasoning potential, while open-source models have untapped potential limited by training scale and data diversity. VideoTPO significantly enhances reasoning performance without additional training, data, or reward models.

Conclusion: TiViBench and VideoTPO provide foundations for evaluating and advancing reasoning in video generation models, setting the stage for future research in this emerging field of reasoning-capable video generation.

Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3’s chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

[466] Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine

Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao

Main category: cs.CV

TL;DR: FFSE is a 3D-aware autoregressive framework that enables intuitive, physically-consistent object editing on real-world images by modeling editing as sequences of learned 3D transformations.

DetailsMotivation: Current text-to-image diffusion models excel at semantic image editing but lack 3D-aware object manipulation capabilities, often producing physically inconsistent results or requiring slow 3D reconstruction.

Method: FFSE models editing as sequences of learned 3D transformations, using a 3DObjectEditor dataset constructed from simulated editing sequences across diverse objects and scenes for training under multi-round dynamic conditions.

Result: Extensive experiments show FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios, maintaining realistic background effects and global scene consistency.

Conclusion: FFSE provides an effective framework for 3D-aware object manipulation that enables arbitrary transformations while preserving physical realism and scene consistency across multiple editing rounds.

Abstract: Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.

[467] Segment Anything Across Shots: A Method and Benchmark

Hengrui Hu, Kaining Ying, Henghui Ding

Main category: cs.CV

TL;DR: The paper proposes SAAS model for multi-shot video object segmentation, addressing shot transition challenges through transition mimicking augmentation and achieves SOTA performance on new benchmarks.

DetailsMotivation: Existing VOS methods struggle with shot discontinuities in multi-shot videos, limiting real-world applicability due to severe annotated multi-shot data sparsity.

Method: Proposes transition mimicking data augmentation (TMA) for cross-shot generalization with single-shot data, and SAAS model that detects and comprehends shot transitions effectively.

Result: SAAS achieves state-of-the-art performance on YouMVOS and Cut-VOS benchmarks by effectively handling complex transitions across shots.

Conclusion: The work enables effective multi-shot VOS through novel data augmentation and model design, supported by new benchmark datasets for future research.

Abstract: This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.

[468] Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li, Kaiming He

Main category: cs.CV

TL;DR: The paper proposes JiT (Just image Transformers), a diffusion model that directly predicts clean images rather than noise, leveraging the manifold assumption that natural data lies on low-dimensional manifolds.

DetailsMotivation: Current diffusion models predict noise/noised quantities rather than clean data, which contradicts the manifold assumption that natural data occupies low-dimensional manifolds while noised data does not.

Method: Use simple large-patch Transformers on pixels that directly predict clean images, without tokenizers, pre-training, or extra losses. Operates with large patch sizes (16, 32) on ImageNet at 256×256 and 512×512 resolutions.

Result: Competitive results on ImageNet at 256×256 and 512×512 resolutions, showing that predicting clean data allows apparently under-capacity networks to work effectively in high-dimensional spaces.

Conclusion: Directly predicting clean data according to the manifold assumption enables effective generative modeling with simple Transformers, providing a self-contained paradigm for Transformer-based diffusion on raw natural data.

Abstract: Today’s denoising diffusion models do not “denoise” in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than “$\textbf{Just image Transformers}$”, or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

[469] Using Self-Supervised Auxiliary Tasks to Improve Fine-Grained Facial Representation

Mahdi Pourmirzaei, Gholam Ali Montazer, Farzaneh Esmaili

Main category: cs.CV

TL;DR: Training from scratch with strong augmentation matches ImageNet fine-tuning for FER. Proposed Hybrid Multi-Task Learning (HMTL) combines supervised learning with self-supervised objectives (puzzling and inpainting) to achieve state-of-the-art results without extra pretraining data.

DetailsMotivation: Challenge the assumption that transfer learning is always valuable for FER, showing that training from random initialization with strong augmentation can match or surpass ImageNet fine-tuning.

Method: Hybrid Multi-Task Learning (HMTL) that augments supervised learning with self-supervised learning objectives (puzzling and inpainting with perceptual loss) during training while keeping inference model unchanged.

Result: Achieved state-of-the-art accuracy on AffectNet in eight-emotion setting without additional pretraining data, with larger gains in low-data regimes. Also improved performance on other fine-grained facial analysis tasks like head pose estimation and gender recognition.

Conclusion: Aligned SSL auxiliaries are an effective and simple way to strengthen supervised fine-grained facial representation without adding extra computation cost during inference time.

Abstract: Facial emotion recognition (FER) is a fine-grained problem where the value of transfer learning is often assumed. We first quantify this assumption and show that, on AffectNet, training from random initialization with sufficiently strong augmentation consistently matches or surpasses fine-tuning from ImageNet. Motivated by this result, we propose Hybrid Multi-Task Learning (HMTL) for FER in the wild. HMTL augments supervised learning (SL) with self-supervised learning (SSL) objectives during training, while keeping the inference-time model unchanged. We instantiate HMTL with two tailored pretext tasks, puzzling and inpainting with a perceptual loss, that encourage part-aware and expression-relevant features. On AffectNet, both HMTL variants achieve state-of-the-art accuracy in the eight-emotion setting without any additional pretraining data, and they provide larger gains under low-data regimes. Compared with conventional SSL pretraining, HMTL yields stronger downstream performance. Beyond FER, the same strategy improves fine-grained facial analysis tasks, including head pose estimation and gender recognition. These results suggest that aligned SSL auxiliaries are an effective and simple way to strengthen supervised fine-grained facial representation without adding extra computation cost during inference time.

[470] HIBMatch: Hypergraph Information Bottleneck for Semi-supervised Alzheimer’s Progression

Zhongying Deng, Shujun Wang, Angelica I Aviles-Rivero, Zoe Kourtzi, Carola-Bibiane Schönlieb

Main category: cs.CV

TL;DR: HIBMatch is a semi-supervised multimodal hypergraph framework that uses information bottleneck and consistency regularization for Alzheimer’s disease progression prediction in MCI patients, outperforming state-of-the-art methods.

DetailsMotivation: Existing Alzheimer's progression prediction methods rely heavily on labeled data and fail to distinguish which current features are relevant for predicting future progression years later.

Method: Uses hypergraphs for multimodal data representation, Hypergraph Information Bottleneck (HIB) to filter irrelevant information, consistency regularization between HIB and classifier, and cross-modal contrastive loss for unlabeled data utilization.

Result: Extensive experiments on ADNI dataset show HIBMatch surpasses existing state-of-the-art methods in Alzheimer’s disease prognosis.

Conclusion: HIBMatch effectively addresses limitations of current methods by focusing on relevant future-progression information and leveraging unlabeled data through semi-supervised learning.

Abstract: Alzheimer’s disease progression prediction is critical for patients with early Mild Cognitive Impairment (MCI) to enable timely intervention and improve their quality of life. While existing progression prediction techniques demonstrate potential with multimodal data, they are highly limited by their reliance on labelled data and fail to account for a key element of future progression prediction: not all features extracted at the current moment may be relevant for predicting progression several years later. To address these limitations in the literature, we design a novel semi-supervised multimodal learning hypergraph architecture, termed HIBMatch, by harnessing hypergraph knowledge based on information bottleneck and consistency regularisation strategies. Firstly, our framework utilises hypergraphs to represent multimodal data, encompassing both imaging and non-imaging modalities. Secondly, to harmonise relevant information from the currently captured data for future MCI conversion prediction, we propose a Hypergraph Information Bottleneck (HIB) that discriminates against irrelevant information, thereby focusing exclusively on harmonising relevant information for future MCI conversion prediction. Thirdly, our method enforces consistency regularisation between the HIB and a discriminative classifier to enhance the robustness and generalisation capabilities of HIBMatch under both topological and feature perturbations. Finally, to fully exploit the unlabeled data, HIBMatch incorporates a cross-modal contrastive loss for data efficiency. Extensive experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our proposed HIBMatch framework surpasses existing state-of-the-art methods in Alzheimer’s disease prognosis.

[471] DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection

Jiang Liu, Chun Pong Lau, Zhongliang Guo, Yuxiang Guo, Zhaoyang Wang, Rama Chellappa

Main category: cs.CV

TL;DR: DiffProtect uses diffusion models to generate adversarial face images that protect privacy by fooling facial recognition systems while maintaining high visual quality.

DetailsMotivation: Address privacy concerns from facial recognition systems on social media by improving adversarial attack methods that currently suffer from poor visual quality or low success rates.

Method: Utilizes a diffusion autoencoder to generate semantically meaningful perturbations for creating adversarial examples against facial recognition systems.

Result: Achieves 24.5% and 25.1% absolute improvements in attack success rates on CelebA-HQ and FFHQ datasets while producing more natural-looking encrypted images than state-of-the-art methods.

Conclusion: Diffusion models can effectively generate high-quality adversarial examples for facial recognition protection, balancing both visual quality and attack performance.

Abstract: The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from being identified by unauthorized FR systems utilizing adversarial attacks to generate encrypted face images. However, existing methods suffer from poor visual quality or low attack success rates, which limit their utility. Recently, diffusion models have achieved tremendous success in image generation. In this work, we ask: can diffusion models be used to generate adversarial examples to improve both visual quality and attack performance? We propose DiffProtect, which utilizes a diffusion autoencoder to generate semantically meaningful perturbations on FR systems. Extensive experiments demonstrate that DiffProtect produces more natural-looking encrypted images than state-of-the-art methods while achieving significantly higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the CelebA-HQ and FFHQ datasets.

[472] A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset

Stefano Woerner, Arthur Jaques, Christian F. Baumgartner

Main category: cs.CV

TL;DR: MedIMeta is a standardized multi-domain medical imaging meta-dataset with 19 datasets across 10 domains and 54 medical tasks, designed to address data scarcity and preprocessing challenges in medical AI.

DetailsMotivation: Medical image analysis faces challenges with scarce, diverse datasets requiring extensive preprocessing due to varying formats and sizes, hindering machine learning applications.

Method: Created MedIMeta - a standardized meta-dataset containing 19 medical imaging datasets spanning 10 domains and 54 tasks, formatted for direct use in PyTorch and other ML frameworks.

Result: Technical validation showed MedIMeta’s utility through fully supervised and cross-domain few-shot learning baselines, demonstrating its effectiveness as a standardized resource.

Conclusion: MedIMeta provides a comprehensive, standardized solution to overcome data scarcity and preprocessing challenges in medical imaging, enabling more efficient machine learning research and applications.

Abstract: While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.

[473] Lane Graph Extraction from Aerial Imagery via Lane Segmentation Refinement with Diffusion Models

Antonio Ruiz, Andrew Melnik, Nicolo Savioli, Dong Wang, Yanfeng Zhang, Helge Ritter

Main category: cs.CV

TL;DR: A novel approach that refines lane masks from CNNs using diffusion models to improve lane graph extraction from aerial imagery, achieving better connectivity and performance metrics.

DetailsMotivation: Previous CNN-based methods for lane graph extraction often produce incomplete and inaccurate lane masks due to occlusions, lighting variations, and road texture changes, leading to poor-quality lane graphs.

Method: Proposes refining CNN-generated lane masks using diffusion models to enhance mask quality before applying segmentation-to-graph algorithms.

Result: Outperforms existing CNN-only and diffusion-only methods, with gains of 1.5% in GEO F1 and 3.5% in TOPO F1 over best CNN method, and 28%/34% improvements over prior diffusion approach.

Conclusion: The diffusion-based refinement approach significantly enhances lane graph quality, particularly improving connectivity metrics, with ablation studies validating the effectiveness of individual components.

Abstract: The lane graph is critical for applications such as autonomous driving and lane-level route planning. While previous research has focused on extracting lane-level graphs from aerial imagery using convolutional neural networks (CNNs) followed by post-processing segmentation-to-graph algorithms, these methods often face challenges in producing sharp and complete segmentation masks. Challenges such as occlusions, variations in lighting, and changes in road texture can lead to incomplete and inaccurate lane masks, resulting in poor-quality lane graphs. To address these challenges, we propose a novel approach that refines the lane masks, output by a CNN, using diffusion models. Experimental results on a publicly available dataset demonstrate that our method outperforms existing methods based solely on CNNs or diffusion models, particularly in terms of graph connectivity. Our lane mask refinement approach enhances the quality of the extracted lane graph, yielding gains of approximately 1.5% in GEO F1 and 3.5% in TOPO F1 scores over the best-performing CNN-based method, and improvements of 28% and 34%, respectively, compared to a prior diffusion-based approach. Both GEO F1 and TOPO F1 scores are critical metrics for evaluating lane graph quality. Additionally, ablation studies are conducted to evaluate the individual components of our approach, providing insights into their respective contributions and effectiveness.

[474] 3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin

Main category: cs.CV

TL;DR: A method that combines 3D-free and 3D-based approaches to generate camera-controlled novel views from a single image, handling complex scenes without extensive 3D training data.

DetailsMotivation: Existing 3D NVS methods require extensive 3D training data and lack generalization, while 3D-free methods lack camera control. The goal is to combine benefits of both approaches for camera-controlled view synthesis from single images.

Method: Leverages pretrained NVS models for weak guidance, integrates this into 3D-free view synthesis approach, and enriches CLIP vision-language space with 3D camera angle information.

Result: Outperforms existing models in qualitative and quantitative evaluations, achieving high-fidelity, consistent novel view synthesis at desired camera angles across diverse scenes while maintaining image clarity.

Conclusion: The method successfully combines 3D-free and 3D-based approaches to enable camera-controlled novel view synthesis from single images without extensive 3D training data, demonstrating superior performance across various scenes.

Abstract: Recent 3D novel view synthesis (NVS) methods often require extensive 3D data for training, and also typically lack generalization beyond the training distribution. Moreover, they tend to be object centric and struggle with complex and intricate scenes. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis style approach, along with enriching the CLIP vision-language space with 3D camera angle information, to achieve the desired results. Experimental results demonstrate that our method outperforms existing models in both qualitative and quantitative evaluations, achieving high-fidelity, consistent novel view synthesis at desired camera angles across a wide variety of scenes while maintaining accurate, natural detail representation and image clarity across various viewpoints. We also support our method with a comprehensive analysis of 2D image generation models and the 3D space, providing a solid foundation and rationale for our solution.

[475] BadVim: Unveiling Backdoor Threats in Visual State Space Model

Cheng-Yi Lee, Yu-Hsuan Chiang, Zhong-You Wu, Chia-Mu Yu, Chun-Shien Lu

Main category: cs.CV

TL;DR: BadVim is a novel backdoor attack framework targeting Visual State Space Models (VSSMs) that achieves over 97% attack success rate by poisoning only 0.3% of training data with low-rank perturbations on state-wise transitions.

DetailsMotivation: To investigate the robustness of Visual State Space Models against backdoor attacks, as these models show remarkable performance but their state-space representation properties may create vulnerabilities.

Method: Developed BadVim framework that uses low-rank perturbations on state-wise transitions during training to create backdoors, requiring only minimal data poisoning (0.3% of training data).

Result: Achieved over 97% attack success rate across three datasets, bypassing state-of-the-art defenses. VSSMs showed comparable backdoor robustness to Vision Transformers and superior robustness to CNNs.

Conclusion: The state-space representation that enhances VSSM capability also contributes to vulnerability against backdoor attacks, highlighting the trade-off between performance and robustness in model design.

Abstract: Visual State Space Models (VSSM) have shown remarkable performance in various computer vision tasks. However, backdoor attacks pose significant security challenges, causing compromised models to predict target labels when specific triggers are present while maintaining normal behavior on benign samples. In this paper, we investigate the robustness of VSSMs against backdoor attacks. Specifically, we delicately design a novel framework for VSSMs, dubbed BadVim, which utilizes low-rank perturbations on state-wise to uncover their impact on state transitions during training. By poisoning only $0.3%$ of the training data, our attacks cause any trigger-embedded input to be misclassified to the targeted class with a high attack success rate (over 97%) at inference time. Our findings suggest that the state-space representation property of VSSMs, which enhances model capability, may also contribute to its vulnerability to backdoor attacks. Our attack exhibits effectiveness across three datasets, even bypassing state-of-the-art defenses against such attacks. Extensive experiments show that the backdoor robustness of VSSMs is comparable to that of Transformers (ViTs) and superior to that of Convolutional Neural Networks (CNNs). We believe our findings will prompt the community to reconsider the trade-offs between performance and robustness in model design.

[476] Segmentation and Smoothing Affect Explanation Quality More Than the Choice of Perturbation-based XAI Method for Image Explanations

Gustav Grund Pihlgren, Kary Främling

Main category: cs.CV

TL;DR: This paper analyzes perturbation-based image explanation methods, finding that attribution calculation has little impact while segmentation and per-pixel attribution significantly affect performance.

DetailsMotivation: To understand which parameters of perturbation-based image explanation methods are responsible for their varying performance, as current understanding is limited despite many methods existing.

Method: Used RISE method as baseline to evaluate combinations of mask sampling, segmentation techniques, smoothing, attribution calculation, and per-segment vs per-pixel attribution using a proxy metric.

Result: Attribution calculation has little impact on results, while segmentation and per-pixel attribution (rarely examined parameters) have significant impact.

Conclusion: Future work on perturbation-based explanation methods should focus more on segmentation and per-pixel attribution rather than attribution calculation.

Abstract: Perturbation-based post-hoc image explanation methods are commonly used to explain image prediction models. These methods perturb parts of the input to measure how those parts affect the output. Since the methods only require the input and output, they can be applied to any model, making them a popular choice to explain black-box models. While many different methods exist and have been compared with one another, it remains poorly understood which parameters of the different methods are responsible for their varying performance. This work uses the Randomized Input Sampling for Explanations (RISE) method as a baseline to evaluate many combinations of mask sampling, segmentation techniques, smoothing, attribution calculation, and per-segment or per-pixel attribution, using a proxy metric. The results show that attribution calculation, which is frequently the focus of other works, has little impact on the results. Conversely, segmentation and per-pixel attribution, rarely examined parameters, have a significant impact. The implementation of and data gathered in this work are available online: https://github.com/guspih/post-hoc-image-perturbation and https://bit.ly/smooth-mask-perturbation.

[477] An Efficient Watermarking Method for Latent Diffusion Models via Low-Rank Adaptation and Dynamic Loss Weighting

Dongdong Lin, Yue Li, Benedetta Tondi, Kaiqing Lin, Bin Li, Mauro Barni

Main category: cs.CV

TL;DR: Efficient watermarking method for Latent Diffusion Models using Low-Rank Adaptation (LoRA) that embeds watermarks with minimal impact on image quality while maintaining robustness.

DetailsMotivation: With the proliferation of large models, efficient watermark embedding is essential to manage computational demands and prevent model performance degradation while protecting intellectual property.

Method: Uses Low-Rank Adaptation (LoRA) to introduce trainable low-rank parameters into frozen LDMs, preserving original weights. Includes dynamic loss weight scheduler to balance generative quality and watermark fidelity.

Result: Method ensures fast and accurate watermark embedding with high-quality generated images, maintaining robustness comparable or superior to state-of-the-art approaches. Generalizes well across datasets and base LDMs.

Conclusion: The proposed EW-LoRA method provides an efficient and effective watermarking solution for large diffusion models, balancing computational efficiency with robust watermark protection.

Abstract: The rapid proliferation of Deep Neural Networks (DNNs) is driving a surge in model watermarking technologies, as the trained models themselves constitute valuable intellectual property. Existing watermarking approaches primarily focus on modifying model parameters or altering sampling behaviors. However, with the emergence of increasingly large models, improving the efficiency of watermark embedding becomes essential to manage increasing computational demands. Prioritizing efficiency not only optimizes resource utilization, making the watermarking process more applicable for large models, but also mitigates potential degradation of model performance. In this paper, we propose an efficient watermarking method for Latent Diffusion Models (LDMs) based on Low-Rank Adaptation (LoRA). The core idea is to introduce trainable low-rank parameters into the frozen LDM to embed watermark, thereby preserving the integrity of the original model weights. Furthermore, a dynamic loss weight scheduler is designed to adaptively balance the objectives of generative quality and watermark fidelity, enabling the model to achieve effective watermark embedding with minimal impact on quality of the generated images. Experimental results show that the proposed method ensures fast and accurate watermark embedding and a high quality of the generated images, at the same time maintaining a level of robustness aligned - in some cases superior - with state-of-the-art approaches. Moreover, the method generalizes well across different datasets and base LDMs. Codes are available at: https://github.com/MrDongdongLin/EW-LoRA.

[478] Revisiting Long-Tailed Learning: Insights from an Architectural Perspective

Yuhan Pan, Yanan Sun, Wei Gong

Main category: cs.CV

TL;DR: This paper analyzes how neural network architectures affect Long-Tailed recognition performance and proposes LT-DARTS, a NAS method with optimized convolutional operations and search strategies for imbalanced data.

DetailsMotivation: To bridge the gap between Long-Tailed recognition challenges and neural network design, as architecture choices significantly impact performance but have received limited attention in LT settings.

Method: Systematic analysis of network components (topology, convolutions, activation functions), proposal of two optimized convolutional operations, and development of LT-DARTS - a NAS method with novel search space and strategy for LT data.

Result: The approach consistently outperforms existing architectures across multiple LT datasets, achieving parameter-efficient state-of-the-art results when integrated with current LT methods.

Conclusion: Neural architecture design is crucial for Long-Tailed recognition, and the proposed LT-DARTS method effectively addresses LT challenges through optimized operations and NAS-based exploration.

Abstract: Long-Tailed (LT) recognition has been widely studied to tackle the challenge of imbalanced data distributions in real-world applications. However, the design of neural architectures for LT settings has received limited attention, despite evidence showing that architecture choices can substantially affect performance. This paper aims to bridge the gap between LT challenges and neural network design by providing an in-depth analysis of how various architectures influence LT performance. Specifically, we systematically examine the effects of key network components on LT handling, such as topology, convolutions, and activation functions. Based on these observations, we propose two convolutional operations optimized for improved performance. Recognizing that operation interactions are also crucial to network effectiveness, we apply Neural Architecture Search (NAS) to facilitate efficient exploration. We propose LT-DARTS, a NAS method with a novel search space and search strategy specifically designed for LT data. Experimental results demonstrate that our approach consistently outperforms existing architectures across multiple LT datasets, achieving parameter-efficient, state-of-the-art results when integrated with current LT methods.

[479] A Framework for Real-Time Volcano-Seismic Event Recognition Based on Multi-Station Seismograms and Semantic Segmentation Models

Camilo Espinosa-Curilem, Millaray Curilem, Daniel Basualto

Main category: cs.CV

TL;DR: The paper introduces a semantic segmentation approach for automated seismic event recognition in volcano monitoring, using 2D representations of multi-channel 1D seismic signals to perform simultaneous detection and classification of five seismic event types.

DetailsMotivation: Traditional manual analysis of seismic events is subjective and labor-intensive, while current automatic methods often separate detection and classification, rely on single stations, and require extensive preprocessing, limiting real-time monitoring applications across different volcanoes.

Method: Proposes using semantic segmentation models (UNet, UNet++, DeepLabV3+, SwinUNet) on 2D representations of multi-channel seismic data with minimal preprocessing, enabling end-to-end simultaneous detection and classification of five seismic event classes.

Result: Evaluated on ~25,000 events from four Chilean volcanoes, UNet performed best with mean F1 score of 0.91 and IoU of 0.88, showing superior noise robustness and generalization to unseen volcano datasets.

Conclusion: The semantic segmentation approach provides an effective, data-driven solution for automated seismic event recognition that integrates multi-station data with minimal preprocessing, achieving high performance and demonstrating strong generalization capabilities across different volcanic environments.

Abstract: In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor-intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real-time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi-channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data-driven, end-to-end design that integrates multi-station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state-of-the-art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chillán Volcanic Complex, Laguna del Maule, Villarrica and Puyehue-Cordón Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.

[480] Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

Yifan Wang, Xiaochen Yang, Fanqi Pu, Qingmin Liao, Wenming Yang

Main category: cs.CV

TL;DR: MonoASRH is a novel monocular 3D object detection framework that addresses limitations in existing methods by combining global semantic feature extraction with adaptive scale-aware regression.

DetailsMotivation: Existing monocular 3D detection methods rely on progressive cross-scale feature aggregation and local information, leading to lack of global awareness, omission of small objects, and inaccurate receptive fields due to scale variation across scenes and depths.

Method: Proposes MonoASRH with two key components: Efficient Hybrid Feature Aggregation Module (EH-FAM) using multi-head attention for global semantic features and lightweight convolution for cross-scale aggregation, and Adaptive Scale-Aware 3D Regression Head (ASRH) that fuses scale features with semantic features and learns dynamic receptive field offsets.

Result: Extensive experiments on KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance in monocular 3D object detection.

Conclusion: The proposed framework effectively addresses scale variation issues and improves detection performance by combining global semantic awareness with adaptive scale-aware regression.

Abstract: Monocular 3D object detection has attracted great attention due to simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these methods predominantly rely on progressive cross-scale feature aggregation and focus solely on local information, which may result in a lack of global awareness and the omission of small-scale objects. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduces MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale priors into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance.

[481] Preserving Angles Improves Feature Distillation

Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell

Main category: cs.CV

TL;DR: CosPress is a feature distillation method that compresses teacher model’s latent space into student’s smaller space while preserving cosine similarities between embeddings, improving accuracy, robustness, and OOD detection performance.

DetailsMotivation: Standard knowledge distillation fails to effectively transfer properties like robustness and OOD detection from foundation models, while existing feature distillation methods also fall short in matching these critical properties.

Method: Introduces Cosine-similarity Preserving Compression (CosPress) - a feature distillation technique that learns a mapping to compress teacher’s latent space into student’s smaller space while preserving cosine similarities between image embeddings.

Result: Produces more accurate models with better performance on generalizability, robustness and OOD detection benchmarks across various datasets including ImageNet, and enables training highly performant lightweight models on small datasets.

Conclusion: CosPress provides a competitive pathway for compressing foundation models while faithfully reproducing teacher properties, outperforming existing distillation methods in critical areas like robustness and OOD detection.

Abstract: Knowledge distillation methods compress models by training a student network using the classification outputs of a high quality teacher model, but can fail to effectively transfer the properties of computer vision foundation models from the teacher to the student. While it has been recently shown that feature distillation$\unicode{x2013}$where a teacher model’s output features are replicated instead$\unicode{x2013}$can reproduce performance for foundation models across numerous downstream tasks, they fall short in matching critical properties such as robustness and out-of-distribution (OOD) detection performance. This paper overcomes this shortcoming by introducing Cosine-similarity Preserving Compression (CosPress), a feature distillation technique that learns a mapping to compress the latent space of the teacher model into the smaller latent space of the student, by preserving the cosine similarities between image embeddings. This enables direct optimisation of the student network and produces a more faithful reproduction of the teacher’s properties. It is shown that distillation with CosPress on a variety of datasets, including ImageNet, produces more accurate models with greater performance on generalisability, robustness and OOD detection benchmarks, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets. Code is available at github.com/emannix/cospress.

[482] Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Junjie Chen, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang

Main category: cs.CV

TL;DR: FiCoCo framework accelerates MLLMs by optimizing multimodal context length through a ‘filter-correlate-compress’ approach, achieving up to 14.7x FLOPs reduction with 93.6% performance retention without retraining.

DetailsMotivation: Quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length creates computational and memory challenges that hinder real-world deployment.

Method: Filter-correlate-compress framework with FiCoCo-V (training-free vision encoder optimization using redundancy-based token discard and correlation-based information recycling) and FiCoCo-L (task-aware textual priors in LLM decoder).

Result: Achieves up to 14.7x FLOPs reduction with 93.6% performance retention, consistently outperforms state-of-the-art training-free approaches across various model architectures, sizes, and tasks.

Conclusion: FiCoCo series effectively accelerates MLLMs while maintaining performance, demonstrating effectiveness and generalizability without requiring retraining.

Abstract: The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ‘‘filter-correlate-compress’’ framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content. The framework’s FiCoCo-L variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the FiCoCo series effectively accelerates a range of MLLMs, achieves up to 14.7x FLOPs reduction with 93.6% performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. Code: https://github.com/kawhiiiileo/FiCoCo

[483] Density-aware global-local attention network for point cloud segmentation

Chade Li, Pengju Zhang, Jiaming Zhang, Yihong Wu

Main category: cs.CV

TL;DR: A point cloud segmentation network that fuses local attention based on density perception with global attention to better handle small objects and categories with small sample sizes.

DetailsMotivation: Existing point cloud segmentation networks struggle with small objects and categories with small sample sizes in real-world scenes, leading to information loss in dense areas.

Method: Proposes a network that divides different sized windows for local areas with different densities to compute attention, treats local areas as tokens for global attention, and introduces category-response loss to balance different categories and object sizes.

Result: Achieves competitive results in semantic segmentation and part segmentation tasks on multiple public datasets, and demonstrates strong segmentation capability for small objects and small sample categories in complex real-world scenes.

Conclusion: The proposed density-aware local attention fusion with global attention effectively handles small objects and small sample categories in point cloud segmentation, providing improved performance in real-world applications.

Abstract: 3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.

[484] RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen

Main category: cs.CV

TL;DR: RAC3 is a framework that enhances vision-language models for autonomous driving corner case comprehension using retrieval-augmented strategies, cross-modal alignment, and multimodal chain-of-thought prompting.

DetailsMotivation: Vision-language models face challenges like hallucination and insufficient real-world grounding in critical driving scenarios, compromising safety and reliability in autonomous driving systems.

Method: Integrates frequency-spatial fusion image encoder, cross-modal alignment training with hard/semi-hard negative mining, K-Means clustering with HNSW indexing for fast retrieval, multimodal chain-of-thought prompting, and continual learning mechanism.

Result: Achieves highest score of 74.46 on CODA-LM benchmark and shows consistent performance gains when integrated with end-to-end frameworks like DriveLM, significantly improving corner case comprehension across multiple downstream tasks.

Conclusion: Demonstrates effectiveness of retrieval-augmented strategies and cross-modal alignment for safer and more interpretable autonomous driving.

Abstract: Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-language models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, RAC3, a novel framework designed to enhance the performance of VLMs in corner case comprehension, is proposed. RAC3 integrates a frequency-spatial fusion (FSF) image encoder, a cross-modal alignment training method for embedding models with hard and semi-hard negative mining, and a fast querying and retrieval pipeline based on K-Means clustering and hierarchical navigable small world (HNSW) indexing. A multimodal chain-of-thought (CoT) prompting strategy to guide analogical reasoning and reduce hallucinations during inference is introduced. Moreover, an update mechanism is integrated into RAC3 to ensure continual learning within the framework. Extensive experiments on the CODA and nuScenes datasets demonstrate that RAC3 significantly improves corner case comprehension across multiple downstream tasks. Compared to prior state-of-the-art methods, RAC3 achieves the highest final score of 74.46 on the CODA-LM benchmark and shows consistent performance gains when integrated with end-to-end frameworks like DriveLM. These results demonstrate the effectiveness of retrieval-augmented strategies and cross-modal alignment for safer and more interpretable autonomous driving.

[485] TopoBDA: Towards Bezier Deformable Attention for Road Topology Understanding

Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel

Main category: cs.CV

TL;DR: TopoBDA uses Bezier Deformable Attention to enhance road topology understanding from multi-camera imagery, achieving state-of-the-art centerline detection and topology reasoning on autonomous driving datasets.

DetailsMotivation: Understanding road topology is crucial for autonomous driving, and existing methods need improvement in detecting elongated polyline structures like lane centerlines.

Method: Processes multi-camera 360-degree imagery to generate BEV features, refined through transformer decoder with Bezier Deformable Attention using Bezier control points, plus auxiliary instance mask loss and one-to-many set prediction loss.

Result: Outperforms existing methods on OpenLane-V2 dataset for centerline detection and topology reasoning, and achieves best results on OpenLane-V1 for 3D lane detection. Multimodal inputs further enhance performance.

Conclusion: TopoBDA effectively improves road topology comprehension through Bezier Deformable Attention and auxiliary components, with potential for further enhancement through multimodal data integration.

Abstract: Understanding road topology is crucial for autonomous driving. This paper introduces TopoBDA (Topology with Bezier Deformable Attention), a novel approach that enhances road topology comprehension by leveraging Bezier Deformable Attention (BDA). TopoBDA processes multi-camera 360-degree imagery to generate Bird’s Eye View (BEV) features, which are refined through a transformer decoder employing BDA. BDA utilizes Bezier control points to drive the deformable attention mechanism, improving the detection and representation of elongated and thin polyline structures, such as lane centerlines. Additionally, TopoBDA integrates two auxiliary components: an instance mask formulation loss and a one-to-many set prediction loss strategy, to further refine centerline detection and enhance road topology understanding. Experimental evaluations on the OpenLane-V2 dataset demonstrate that TopoBDA outperforms existing methods, achieving state-of-the-art results in centerline detection and topology reasoning. TopoBDA also achieves the best results on the OpenLane-V1 dataset in 3D lane detection. Further experiments on integrating multi-modal data – such as LiDAR, radar, and SDMap – show that multimodal inputs can further enhance performance in road topology understanding.

[486] Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng

Main category: cs.CV

TL;DR: The paper introduces I2V-MLLM attack, a black-box adversarial attack method that uses image-based MLLMs as surrogates to craft transferable adversarial videos against video-based MLLMs, addressing limitations of existing methods.

DetailsMotivation: Video-based MLLMs are vulnerable to adversarial examples, but transferability of adversarial videos to unseen models remains unexplored. Existing methods have limitations in black-box settings: poor feature generalization, sparse frame focus, and lack of multimodal integration.

Method: I2V-MLLM attack uses an image-based MLLM as surrogate to craft adversarial videos. It integrates multimodal interactions and spatiotemporal information to disrupt video representations, and employs perturbation propagation to handle different frame sampling strategies.

Result: The method achieves strong transferability across different V-MLLMs on multiple video-text tasks. Black-box attacks using BLIP-2 as surrogate achieve average attack success rates of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks.

Conclusion: I2V-MLLM effectively addresses limitations of existing adversarial attacks for V-MLLMs in black-box scenarios, demonstrating competitive performance compared to white-box attacks and highlighting vulnerabilities in video-text multimodal systems.

Abstract: Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

[487] DehazeGS: Seeing Through Fog with 3D Gaussian Splatting

Jinze Yu, Yiqun Wang, Aiheng Jiang, Zhengda Lu, Jianwei Guo, Yong Li, Hongxing Qin, Xiaopeng Zhang

Main category: cs.CV

TL;DR: DehazeGS uses explicit Gaussian representations with physical scattering models to remove fog from multi-view images, achieving state-of-the-art performance with faster rendering than NeRF-based methods.

DetailsMotivation: Current NeRF-based dehazing methods suffer from high computational costs due to deep neural networks and per-ray sampling, and their implicit representations limit fine-grained detail recovery from hazy scenes.

Method: Learn explicit Gaussian representation using atmospheric scattering model, establish transmission function directly on Gaussian primitives via depth-to-transmission mapping, jointly learn atmospheric light and scattering coefficients during training, and remove scattering effects at inference.

Result: Achieves state-of-the-art performance on both real-world and synthetic foggy datasets, with improved computational efficiency compared to NeRF-based approaches.

Conclusion: Explicit Gaussian representation with physical forward rendering effectively handles fog removal from multi-view images while recovering fine details and maintaining computational efficiency.

Abstract: Current novel view synthesis methods are typically designed for high-quality and clean input images. However, in foggy scenes, scattering and attenuation can significantly degrade the quality of rendering. Although NeRF-based dehazing approaches have been developed, their reliance on deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Furthermore, NeRF’s implicit representation limits its ability to recover fine-grained details from hazy scenes. To overcome these limitations, we propose learning an explicit Gaussian representation to explain the formation mechanism of foggy images through a physically forward rendering process. Our method, DehazeGS, reconstructs and renders fog-free scenes using only multi-view foggy images as input. Specifically, based on the atmospheric scattering model, we simulate the formation of fog by establishing the transmission function directly onto Gaussian primitives via depth-to-transmission mapping. During training, we jointly learn the atmospheric light and scattering coefficients while optimizing the Gaussian representation of foggy scenes. At inference time, we remove the effects of scattering and attenuation in Gaussian distributions and directly render the scene to obtain dehazed views. Experiments on both real-world and synthetic foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance. visualizations are available at https://dehazegs.github.io/

[488] GUSLO: General and Unified Structured Light Optimization

Tinglei Wan, Tonghua Su, Zhongjie Wang

Main category: cs.CV

TL;DR: GUSLO is a unified framework for structured light 3D reconstruction that eliminates scene-specific calibration and works across different SL patterns through single-shot calibration and artifact-aware photometric adaptation.

DetailsMotivation: Existing structured light 3D reconstruction methods require scene-specific calibration with manual parameter tuning and are optimized for specific SL patterns, limiting their generalizability across varied industrial and cultural heritage scenarios.

Method: GUSLO uses two coordinated innovations: (1) single-shot calibration via 2D triangulation-based interpolation to convert sparse matches into dense correspondence fields, and (2) artifact-aware photometric adaptation via explicit transfer functions to balance generalization and color fidelity.

Result: Experiments across binary, speckle, and color-coded settings show GUSLO consistently improves accuracy and cross-encoding robustness over conventional methods in challenging industrial and cultural scenarios.

Conclusion: GUSLO provides a general and unified solution for structured light 3D reconstruction that overcomes the limitations of existing methods by eliminating scene-specific calibration requirements and working effectively across different SL patterns.

Abstract: Structured light (SL) 3D reconstruction captures the precise surface shape of objects, providing high-accuracy 3D data essential for industrial inspection and cultural heritage digitization. However, existing methods suffer from two key limitations: reliance on scene-specific calibration with manual parameter tuning, and optimization frameworks tailored to specific SL patterns, limiting their generalizability across varied scenarios. We propose General and Unified Structured Light Optimization (GUSLO), a novel framework addressing these issues through two coordinated innovations: (1) single-shot calibration via 2D triangulation-based interpolation that converts sparse matches into dense correspondence fields, and (2) artifact-aware photometric adaptation via explicit transfer functions, balancing generalization and color fidelity. We conduct diverse experiments covering binary, speckle, and color-coded settings. Results show that GUSLO consistently improves accuracy and cross-encoding robustness over conventional methods in challenging industrial and cultural scenarios.

[489] SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu

Main category: cs.CV

TL;DR: SVBench is a new benchmark for evaluating Large Vision-Language Models’ ability to understand long-context streaming videos through temporal multi-turn question-answering chains, revealing most open-source models struggle with this task.

DetailsMotivation: Current video understanding benchmarks focus on single-instance text inputs and fail to evaluate temporal reasoning across entire video streams, creating a gap in assessing LVLMs' streaming video understanding capabilities.

Method: Created SVBench with 49,979 QA pairs from 1,353 streaming videos using a semi-automated annotation pipeline that generates QA chains representing consecutive multi-turn dialogues over video segments and constructs temporal linkages between successive chains.

Result: GPT-4o outperforms others, but most open-source LVLMs struggle with long-context streaming video understanding. The authors’ StreamingChat model significantly outperforms open-source LVLMs on SVBench while maintaining comparable performance on other vision-language benchmarks.

Conclusion: SVBench advances streaming video understanding research by providing comprehensive analysis of current LVLMs’ capabilities, highlighting the need for improved temporal reasoning in long-context video streams.

Abstract: Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://github.com/sotayang/SVBench.

[490] Safeguarding AI in Medical Imaging: Post-Hoc Out-of-Distribution Detection with Normalizing Flows

Dariush Lotfi, Mohammad-Ali Nikouei Mahani, Mohamad Koohi-Moghadam, Kyongtae Ty Bae

Main category: cs.CV

TL;DR: Proposes a post-hoc normalizing flow-based OOD detection method that integrates with pre-trained models without retraining, achieving state-of-the-art performance on medical imaging datasets.

DetailsMotivation: Current OOD detection methods require impractical retraining or model modifications, hindering adoption in regulated clinical environments where AI reliability is critical for preventing diagnostic errors.

Method: Post-hoc normalizing flow-based approach that seamlessly integrates with existing pre-trained models without altering their weights, evaluated on MedOOD and MedMNIST datasets.

Result: Achieved 84.61% AUROC on MedOOD (outperforming ViM and MDS) and 93.8% AUROC on MedMNIST (surpassing ViM and ReAct), demonstrating strong OOD detection performance.

Conclusion: The method provides a practical and effective safeguard for clinical imaging workflows due to its strong performance and post-hoc integration capability without model modifications.

Abstract: In AI-driven medical imaging, the failure to detect out-of-distribution (OOD) data poses a severe risk to clinical reliability, potentially leading to critical diagnostic errors. Current OOD detection methods often demand impractical retraining or modifications to pre-trained models, hindering their adoption in regulated clinical environments. To address this challenge, we propose a post-hoc normalizing flow-based approach that seamlessly integrates with existing pre-trained models without altering their weights. We evaluate the approach on our in-house-curated MedOOD dataset, designed to capture clinically relevant distribution shifts, and on the MedMNIST benchmark. The proposed method achieves an AUROC of 84.61% on MedOOD, outperforming ViM (80.65%) and MDS (80.87%), and reaches 93.8% AUROC on MedMNIST, surpassing ViM (88.08%) and ReAct (87.05%). This combination of strong performance and post-hoc integration capability makes our approach a practical and effective safeguard for clinical imaging workflows. The model and code to build OOD datasets are publicly accessible at https://github.com/dlotfi/MedOODFlow.

[491] RTGen: Real-Time Generative Detection Transformer

Chi Ruan, Jiying Zhao, Wenhu Chen

Main category: cs.CV

TL;DR: RTGen is a real-time generative object detector that uses a novel Region-Language Decoder to jointly decode visual and textual representations, achieving 131.3 FPS on T4 GPUs without relying on external supervision.

DetailsMotivation: Existing generative object detectors suffer from structural redundancy and substantial latency despite enabling direct category name generation, creating a need for real-time performance.

Method: Proposes RTGen with a succinct encoder-decoder architecture featuring a Region-Language Decoder that organizes textual side as a Directed Acyclic Graph for non-autoregressive category naming.

Result: RTGen-R34 achieves 131.3 FPS on T4 GPUs (270x faster than GenerateU) and learns to generate category names directly from detection labels without external supervision.

Conclusion: RTGen enables efficient and flexible open-ended detection with real-time performance while eliminating dependency on external models like CLIP or pretrained language models.

Abstract: Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.

[492] Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring

Yuyan Chen, Nico Lang, B. Christian Schmidt, Aditya Jain, Yves Basset, Sara Beery, Maxim Larrivée, David Rolnick

Main category: cs.CV

TL;DR: Open-Insect dataset introduced for evaluating unknown species detection in biodiversity monitoring, benchmarking 38 OSR algorithms with simple post-hoc methods performing well.

DetailsMotivation: Global biodiversity decline with 90% of species unknown, current ML models can't detect unseen species (open-set recognition problem), limiting applicability for diverse taxa like insects.

Method: Created Open-Insect dataset for fine-grained unknown species detection across geographic regions, benchmarked 38 OSR algorithms in three categories: post-hoc, training-time regularization, and training with auxiliary data.

Result: Simple post-hoc approaches remain strong baselines, auxiliary data can improve species discovery in data-limited regions.

Conclusion: Provides insights for developing computer vision methods for biodiversity monitoring and species discovery, addressing the open-set recognition challenge in highly diverse taxa.

Abstract: Global biodiversity is declining at an unprecedented rate, yet little information is known about most species and how their populations are changing. Indeed, some 90% of Earth’s species are estimated to be completely unknown. Machine learning has recently emerged as a promising tool to facilitate long-term, large-scale biodiversity monitoring, including algorithms for fine-grained classification of species from images. However, such algorithms typically are not designed to detect examples from categories unseen during training – the problem of open-set recognition (OSR) – limiting their applicability for highly diverse, poorly studied taxa such as insects. To address this gap, we introduce Open-Insect, a large-scale, fine-grained dataset to evaluate unknown species detection across different geographic regions with varying difficulty. We benchmark 38 OSR algorithms across three categories: post-hoc, training-time regularization, and training with auxiliary data, finding that simple post-hoc approaches remain a strong baseline. We also demonstrate how to leverage auxiliary data to improve species discovery in regions with limited data. Our results provide insights to guide the development of computer vision methods for biodiversity monitoring and species discovery.

[493] BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation

Gangwei Xu, Haotong Lin, Zhaoxing Zhang, Hongcheng Luo, Haiyang Sun, Xin Yang

Main category: cs.CV

TL;DR: BAT is a novel framework for event-based optical flow estimation using bidirectional adaptive temporal correlation, achieving state-of-the-art performance on DSEC-Flow benchmark with accurate future flow prediction.

DetailsMotivation: Event cameras offer high dynamic range and temporal resolution advantages for optical flow estimation, but current methods suffer from spatial sparsity limitations of event data.

Method: Three novel designs: 1) bidirectional temporal correlation transforming dense motion cues, 2) adaptive temporal sampling for consistency, 3) spatially adaptive temporal motion aggregation to efficiently aggregate consistent motion features.

Result: Ranked 1st on DSEC-Flow benchmark, outperforming state-of-the-art methods by large margin with sharp edges and high-quality details. Accurately predicts future optical flow using only past events.

Conclusion: BAT framework effectively addresses spatial sparsity in event data and enables superior optical flow estimation with future prediction capabilities, significantly advancing event-based vision methods.

Abstract: Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank $1^{st}$ on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT’s warm-start approach. Code: \textcolor{magenta}{https://github.com/gangweiX/BAT}.

[494] S4M: 4-points to Segment Anything

Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Shih-Min Yin, Didier Mutter, Nicolas Padoy

Main category: cs.CV

TL;DR: S4M enhances SAM for medical segmentation by using structured 4-point prompts with role-specific embeddings and a Canvas pretext task, improving performance and reducing annotation effort.

DetailsMotivation: SAM's point prompts are ambiguous for medical segmentation due to overlapping anatomy and blurred boundaries, requiring manual refinement cycles. Better prompting strategies are needed to reduce annotation effort.

Method: Proposed S4M with structured 4-point prompts (extreme points or major/minor axis endpoints), role-specific embeddings, and Canvas pretext task for geometry-aware reasoning.

Result: +3.42 mIoU improvement over SAM baseline across 8 ultrasound/surgical endoscopy datasets; major/minor prompts enable faster clinician annotation.

Conclusion: S4M increases performance, reduces annotation effort, and aligns with clinical practice, enabling more scalable medical imaging dataset development.

Abstract: Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary “Canvas” pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging.

[495] SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

Jing Zhang, Zhikai Li, Chengzhi Hu, Xuewen Liu, Qingyi Gu

Main category: cs.CV

TL;DR: SAQ-SAM is a post-training quantization method for Segment Anything Model that addresses computational cost challenges through semantic-aware clipping and prompt-aware reconstruction, achieving significant performance improvements in 4-bit quantization.

DetailsMotivation: SAM has prohibitive computational costs making edge deployment challenging, and existing PTQ methods yield unsatisfactory results due to SAM's specialized components and promptable workflow, particularly the mask decoder's extreme activation outliers and neglected semantic interactivity.

Method: Proposes Perceptual-Consistency Clipping using attention focus overlap for aggressive outlier suppression, and Prompt-Aware Reconstruction incorporating image-prompt interactions via cross-attention in mask decoder. Also includes layer-skipping strategy for efficient image token processing in encoder.

Result: Extensive experiments show consistent advantages across various SAM sizes and tasks. When quantizing SAM-B to 4-bit, achieves 11.7% higher mAP than baseline in instance segmentation task.

Conclusion: SAQ-SAM effectively boosts PTQ for SAM through semantic alignment perspective, addressing both outlier suppression and semantic preservation, making SAM more suitable for edge deployment while maintaining performance.

Abstract: Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder’s attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.

[496] TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun

Main category: cs.CV

TL;DR: TEMPLE enhances Video LLM temporal reasoning through Direct Preference Optimization using automated construction of temporality-intensive preference pairs and Progressive Pre-SFT Alignment.

DetailsMotivation: Existing Video LLMs struggle with temporal reasoning due to weak temporal correspondence in data and over-reliance on next-token prediction, lacking proper temporal supervision.

Method: Proposes TEMPLE framework with automated pipeline for constructing temporality-intensive preference pairs (selecting rich videos, designing perturbations, evaluating responses) and Progressive Pre-SFT Alignment with curriculum learning and preference optimization before instruction tuning.

Result: Consistently improves Video LLM performance across multiple benchmarks using relatively small self-generated DPO data.

Conclusion: TEMPLE serves as a scalable and efficient complement to SFT-based methods, enabling development of more reliable Video LLMs with enhanced temporal reasoning capabilities.

Abstract: Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm}, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference LEarning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

[497] SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements

Haiyang Xie, Xi Shen, Shihua Huang, Qirui Wang, Zheng Wang

Main category: cs.CV

TL;DR: SimROD is a lightweight approach for RAW object detection that uses Global Gamma Enhancement and green channel enhancement to improve accuracy while maintaining efficiency, outperforming state-of-the-art methods.

DetailsMotivation: RAW data preserves sensor information before ISP processing, offering advantages for object detection including improved accuracy and more efficient hardware designs by bypassing the ISP, but faces challenges like limited training data, unbalanced pixel distributions, and sensor noise.

Method: Proposes SimROD with Global Gamma Enhancement (GGE) module using learnable global gamma transformation with only four parameters, and leverages the green channel’s richer signal to enhance local details based on human eye sensitivity and Bayer filter design.

Result: Extensive experiments on multiple RAW object detection datasets show SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency.

Conclusion: The work highlights the potential of RAW data for real-world object detection, providing an effective and efficient solution for RAW-based detection tasks.

Abstract: Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel’s richer signal to enhance local details, aligning with the human eye’s sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection. Code is available at https://ocean146.github.io/SimROD2025/.

[498] GaussianFocus: Constrained Attention Focus for 3D Gaussian Splatting

Zexu Huang, Min Xu, Stuart Perry

Main category: cs.CV

TL;DR: GaussianFocus improves 3D Gaussian Splatting by reducing redundant Gaussians and enabling large-scale scene reconstruction through patch attention, Gaussian constraints, and subdivision strategies.

DetailsMotivation: Address limitations of 3D Gaussian Splatting including excessive redundant noisy Gaussians, poor scaling to large scenes due to memory constraints, long optimization times, and variable appearance across views.

Method: Uses patch attention algorithm for rendering quality refinement, Gaussian constraints strategy to minimize redundancy, and subdivision reconstruction strategy that divides large scenes into manageable blocks for individual training.

Result: Significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing State-of-The-Art methods. Successfully manages and renders large urban scenes while maintaining high visual fidelity.

Conclusion: GaussianFocus effectively addresses key limitations of 3D Gaussian Splatting, enabling high-quality reconstruction and rendering of both small and large-scale scenes with improved efficiency and reduced redundancy.

Abstract: Recent developments in 3D reconstruction and neural rendering have significantly propelled the capabilities of photo-realistic 3D scene rendering across various academic and industrial fields. The 3D Gaussian Splatting technique, alongside its derivatives, integrates the advantages of primitive-based and volumetric representations to deliver top-tier rendering quality and efficiency. Despite these advancements, the method tends to generate excessive redundant noisy Gaussians overfitted to every training view, which degrades the rendering quality. Additionally, while 3D Gaussian Splatting excels in small-scale and object-centric scenes, its application to larger scenes is hindered by constraints such as limited video memory, excessive optimization duration, and variable appearance across views. To address these challenges, we introduce GaussianFocus, an innovative approach that incorporates a patch attention algorithm to refine rendering quality and implements a Gaussian constraints strategy to minimize redundancy. Moreover, we propose a subdivision reconstruction strategy for large-scale scenes, dividing them into smaller, manageable blocks for individual training. Our results indicate that GaussianFocus significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing existing State-of-The-Art (SoTA) methods. Furthermore, we demonstrate the capability of our approach to effectively manage and render large scenes, such as urban environments, whilst maintaining high fidelity in the visual output.

[499] Regression-based Pelvic Pose Initialization for Fast and Robust 2D/3D Pelvis Registration

Yehyun Suh, J. Ryan Martin, Daniel Moyer

Main category: cs.CV

TL;DR: Learned initialization function improves 2D/3D pelvis registration by providing better starting points for optimization-based pose estimators, enhancing accuracy and efficiency.

DetailsMotivation: Current optimization-based pose estimators often fail to converge to optimal solutions when initialized naively, limiting their reliability for clinical applications.

Method: Use a learned initialization function to provide coarse but effective starting points for optimization-based pose estimators in 2D/3D pelvis registration.

Result: Experimental validation shows the method consistently achieves robust and accurate registration, even in challenging cases with extreme pose variation, while improving computational efficiency.

Conclusion: A learned initialization function significantly enhances the reliability and performance of 2D/3D pelvis registration for clinical applications.

Abstract: This paper presents an approach for improving 2D/3D pelvis registration in optimization-based pose estimators using a learned initialization function. Current methods often fail to converge to the optimal solution when initialized naively. We find that even a coarse initializer greatly improves pose estimator accuracy, and improves overall computational efficiency. This approach proves to be effective also in challenging cases under more extreme pose variation. Experimental validation demonstrates that our method consistently achieves robust and accurate registration, enhancing the reliability of 2D/3D registration for clinical applications.

[500] CamSAM2: Segment Anything Accurately in Camouflaged Videos

Yuli Zhou, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu, Guolei Sun

Main category: cs.CV

TL;DR: CamSAM2 enhances SAM2’s ability to segment camouflaged objects in videos by adding decamouflaged tokens and object-aware fusion modules without modifying SAM2’s parameters, achieving significant performance gains on VCOS datasets.

DetailsMotivation: SAM2 performs suboptimally on camouflaged video object segmentation (VCOS) with simple prompts like points and boxes, requiring enhancement for better handling of camouflaged scenes.

Method: Introduces decamouflaged tokens for feature adjustment, implicit/explicit object-aware fusion modules for utilizing fine-grained features, and object prototype generation to memorize object details from previous frames.

Result: Achieves substantial improvements over SAM2: 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard using Hiera-T backbone, while adding negligible parameters.

Conclusion: CamSAM2 effectively enhances SAM2’s VCOS capability through lightweight architectural additions, demonstrating significant performance improvements on camouflaged object segmentation tasks.

Abstract: Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2’s capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2’s ability to handle camouflaged scenes without modifying SAM2’s parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.

[501] Revealing the Implicit Noise-based Imprint of Generative Models

Xinghan Li, Yue Yu, Xue Song, Haijun Shan, Jingjing Chen

Main category: cs.CV

TL;DR: NIRNet is a novel AI-generated image detection framework that uses noise-based imprint patterns to improve generalization across different generative models, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Existing AI-generated image detection methods lack generalization capabilities and perform poorly on emerging generative models, creating security risks from synthetic visual content.

Method: Proposes NIRNet with a Noise-based Imprint Simulator to capture intrinsic patterns from different models, and a pipeline using noise patterns alongside visual features for detection.

Result: Achieves state-of-the-art performance across seven diverse benchmarks, including five public datasets and two new generalization tests.

Conclusion: The noise-based imprint approach significantly enhances generalization and robustness in AI-generated image detection, demonstrating superior effectiveness compared to existing methods.

Abstract: With the rapid advancement of vision generation models, the potential security risks stemming from synthetic visual content have garnered increasing attention, posing significant challenges for AI-generated image detection. Existing methods suffer from inadequate generalization capabilities, resulting in unsatisfactory performance on emerging generative models. To address this issue, this paper presents NIRNet (Noise-based Imprint Revealing Network), a novel framework that leverages noise-based imprint for the detection task. Specifically, we propose a novel Noise-based Imprint Simulator to capture intrinsic patterns imprinted in images generated by different models. By aggregating imprint from various generative models, imprint of future models can be extrapolated to expand training data, thereby enhancing generalization and robustness. Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a Noise-based Imprint Extractor, alongside other visual features for AI-generated image detection, significantly improving detection performance. Our approach achieves state-of-the-art performance across seven diverse benchmarks, including five public datasets and two newly proposed generalization tests, demonstrating its superior generalization and effectiveness. Paper Submission: pdf

[502] KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters

Haiduo Huang, Yadong Zhang, Yinghui Xu, Pengju Ren

Main category: cs.CV

TL;DR: KernelDNA is a lightweight convolution kernel plug-in that enables dynamic kernel adaptation through cross-layer weight sharing and adapter-based modulation, achieving superior accuracy-efficiency balance without altering standard convolution structure.

DetailsMotivation: To address the limitations of existing dynamic convolutions that either incur significant parameter overhead, compromise inference speed, or struggle to jointly optimize dynamic attention and static kernels, while leveraging the observation that pre-trained CNNs exhibit inter-layer redundancy similar to LLMs.

Method: Decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation using cross-layer weight sharing and adapter-based modulation, where dense convolutional layers are replaced by derived ‘child’ layers generated from shared ‘parent’ kernels.

Result: Achieves state-of-the-art accuracy-efficiency balance on image classification and dense prediction tasks, preserving native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments.

Conclusion: KernelDNA provides an effective solution for dynamic convolution that maintains parameter efficiency and hardware-friendly inference while enabling dynamic kernel specialization through innovative weight-sharing mechanisms.

Abstract: Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived “child” layers generated from a shared “parent” convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves a state-of-the-art accuracy-efficiency balance among dynamic convolution variants.

[503] QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang

Main category: cs.CV

TL;DR: QDM is a region-adaptive diffusion model for super-resolution that uses quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas, achieving high-fidelity outputs with lower computational costs.

DetailsMotivation: Traditional deep learning SR methods perform uniform pixel-wise computations across entire images, including homogeneous regions where high-resolution refinement is redundant and computationally wasteful.

Method: Proposes Quadtree Diffusion Model (QDM) - a mask-guided, two-stream architecture that uses quadtree structure derived from low-quality input to identify key regions needing refinement and applies minimal computation elsewhere.

Result: QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, particularly effective in medical imaging with large homogeneous regions.

Conclusion: QDM provides an efficient and adaptive approach to super-resolution that balances quality and computational efficiency, making it suitable for resource-limited environments.

Abstract: Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM’s effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at https://github.com/linYDTHU/QDM.

[504] SFMNet: Sparse Focal Modulation for 3D Object Detection

Oren Shrout, Ayellet Tal

Main category: cs.CV

TL;DR: SFMNet is a 3D sparse detector that combines sparse convolution efficiency with long-range dependency modeling using a novel Sparse Focal Modulation module, achieving state-of-the-art performance on autonomous driving datasets.

DetailsMotivation: Traditional sparse convolutions efficiently capture local structures but struggle with long-range dependencies, while transformers can model long-range relationships but have high computational costs due to quadratic complexity.

Method: Built on a novel Sparse Focal Modulation (SFM) module that integrates short- and long-range contexts with linear complexity using hierarchical sparse convolution design.

Result: Achieves state-of-the-art performance on autonomous driving datasets with improved efficiency, making it suitable for large-scale LiDAR scenes.

Conclusion: SFMNet successfully combines the efficiency of sparse convolutions with the ability to model long-range dependencies, providing an effective solution for 3D object detection in large-scale scenes.

Abstract: We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

[505] Is clustering enough for LiDAR instance segmentation? A state-of-the-art training-free baseline

Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit

Main category: cs.CV

TL;DR: Alpine achieves competitive panoptic segmentation using only semantic labels without instance annotations, outperforming supervised methods on standard benchmarks while running in real-time on CPU.

DetailsMotivation: The high cost and time required for manual instance labeling of large-scale point cloud datasets is a major bottleneck in LiDAR panoptic segmentation for autonomous driving applications.

Method: Uses only semantic labels to predict instances without training or annotations, running in real-time on single-threaded CPU with no learning or parameter tuning required.

Result: Outperforms most state-of-the-art supervised methods on SemanticKITTI and nuScenes, ranks first on SemanticKITTI panoptic leaderboard when combined with state-of-the-art semantic segmentation.

Conclusion: Competitive panoptic segmentation can be achieved without instance annotations, offering a fully explainable, real-time solution that eliminates the need for costly manual labeling.

Abstract: Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method outperforms {most} state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. It is fully explainable, and requires no learning or parameter tuning. Alpine combined with state-of-the-art semantic segmentation ranks first on the official panoptic segmentation leaderboard of SemanticKITTI. Code is available at https://github.com/valeoai/Alpine/

[506] Dereflection Any Image with Diffusion Priors and Diversified Data

Jichen Hu, Chen Yang, Zanwei Zhou, Jiemin Fang, Xiaokang Yang, Qi Tian, Wei Shen

Main category: cs.CV

TL;DR: Proposes Dereflection Any Image with DRR dataset and diffusion-based framework for robust single-image reflection removal, achieving SOTA performance across diverse real-world scenarios.

DetailsMotivation: Existing methods struggle with limited generalization due to scarce high-quality data and insufficient restoration priors for complex reflection removal tasks.

Method: Creates Diverse Reflection Removal (DRR) dataset via random rotation of reflective mediums, and develops diffusion-based framework with one-step diffusion and three-stage progressive training including reflection-invariant finetuning.

Result: Achieves state-of-the-art performance on common benchmarks and challenging in-the-wild images, demonstrating superior generalization across diverse real-world scenes.

Conclusion: The proposed comprehensive solution with novel dataset and training strategy enables robust reflection removal with excellent generalization capabilities.

Abstract: Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

[507] Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyoshi

Main category: cs.CV

TL;DR: MRAM is a novel hard attention model that models human visual hierarchy, achieving more human-like attention dynamics and better performance than existing models.

DetailsMotivation: Existing hard attention models like RAM and DRAM fail to model the hierarchy of human vision system, resulting in attention patterns that are either overly fixational or excessively saccadic, diverging from human eye movement behavior.

Method: Proposed Multi-Level Recurrent Attention Model (MRAM) that explicitly models neural hierarchy by decoupling glimpse location generation and task execution in two recurrent layers.

Result: MRAM achieves more human-like attention dynamics and consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

Conclusion: The hierarchical modeling approach in MRAM successfully produces balanced fixation-saccadic behavior while improving performance over existing attention models.

Abstract: Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

[508] Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study

Olgica Zaric, Carmen Leser, Vladimir Juras, Alex Farr, Pavol Szomolanyi, Malina Gologan, Stanislas Rapacchi, Laura Villazan Garcia, Haider Ali, Christian Singer, Siegfried Trattnig, Christian Licht, Ramona Woitek

Main category: cs.CV

TL;DR: Compressed sensing MRI methods can reduce partial volume effects in sodium MRI, improving tumor delineation and sodium concentration quantification in breast cancer patients.

DetailsMotivation: Partial volume effects cause errors in tissue sodium concentration quantification in sodium MRI, and compressed sensing reconstruction methods may help reduce these effects.

Method: Examined 15 participants (3 healthy, 12 breast cancer patients) using 7T MRI with sodium imaging reconstructed using weighted TV, directional TV, anatomically guided TV, and adaptive combine methods.

Result: All methods preserved sodium signal and tissue structures. Anatomically guided TV had highest Dice score (75%). TSC values varied by method (61-88 mmol/L). Strong correlations found between different reconstruction methods.

Conclusion: Tumor appearance and TSC estimates depend on reconstruction method type and parameters, likely due to differences in reducing partial volume effects.

Abstract: Introduction: In sodium (23Na) magnetic resonance imaging (MRI), partial volume effects (PVE) are one of the most common causes of errors in the in vivo quantification of tissue sodium concentration (TSC). Advanced image reconstruction algorithms, such as compressed sensing (CS), have the potential to reduce PVE. Therefore, we investigated the feasibility of using CS-based methods to improve image quality and TSC quantification accuracy in patients with breast cancer. Subjects and methods: In this study, three healthy participants and 12 female participants with breast cancer were examined on a 7T MRI scanner. 23Na-MRI images were reconstructed using weighted total variation (wTV), directional total variation (dTV), anatomically guided total variation (AG-TV) and adaptive combine (ADC) methods. The consistency of tumor volume delineations based on sodium data was assessed using the Dice score, and TSC quantification was performed for various image reconstruction methods. Pearsons correlation coefficients were calculated to assess the relationships between wTV, dTV, AG-TV, and ADC values. Results: All methods provided breast MRI images with well-preserved sodium signal and tissue structures. The mean Dice scores for wTV, dTV, and AG-TV were 65%, 72%, and 75%, respectively. Average TSC values in breast tumors were 61.0, 72.0, 73.0, and 88.0 mmol/L for wTV, dTV, AG-TV, and ADC, respectively. A strong negative correlation was observed between wTV and dTV (r = -0.78, 95% CI [-0.94, -0.31], p = 0.0076) and a strong positive correlation between dTV and AG-TV (r = 0.71, 95% CI [0.16, 0.92], p = 0.0207) was found. Conclusion: The results of this study showed that differences in tumor appearance and TSC estimations may depend on the type of image reconstruction and the parameters used. This is most likely due to differences in their ability to reduce PVE.

[509] FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

Main category: cs.CV

TL;DR: FALCONEye is a training-free video agent that uses a VLM and LLM to answer open-ended questions in hour-long videos through exploration-based search with confidence calibration, outperforming comparable models while being cost-effective.

DetailsMotivation: Current Vision Language Models struggle with hour-long videos due to context window limitations, making information retrieval challenging.

Method: Uses a model-agnostic meta-architecture combining VLM and LLM with exploration-based search guided by calibrated confidence scores.

Result: Outperforms all open-source 7B VLMs and comparable agents on FALCON-Bench, and surpasses GPT-4o on single-detail tasks in MLVU while reducing inference cost by ~10x.

Conclusion: FALCONEye effectively addresses long-video QA challenges with a cost-efficient, training-free approach that generalizes well across different benchmarks.

Abstract: Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM’s answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

[510] vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition

Yunusa Haruna, Adamu Lawan

Main category: cs.CV

TL;DR: vGamba is a hybrid vision backbone that combines state-space models with attention mechanisms to efficiently capture long-range dependencies in visual recognition tasks, achieving better accuracy-efficiency trade-offs than existing models.

DetailsMotivation: Existing methods have limitations: CNNs struggle with restricted receptive fields, while Vision Transformers achieve global context at high computational cost. State-space models offer an alternative but are underexplored in vision tasks.

Method: vGamba integrates SSMs with attention mechanisms through a Gamba bottleneck block containing Gamba Cell (Mamba adaptation for 2D), Multi-Head Self-Attention, and a Gated Fusion Module for effective feature representation.

Result: Extensive experiments on classification, detection, and segmentation tasks show vGamba achieves superior trade-off between accuracy and computational efficiency, outperforming several existing models.

Conclusion: The hybrid approach of combining SSMs with attention mechanisms enables efficient long-range dependency modeling in vision tasks while maintaining accuracy, demonstrating vGamba’s effectiveness across multiple computer vision applications.

Abstract: Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.

[511] CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi

Main category: cs.CV

TL;DR: CA^2ST is a transformer-based method using cross-attention between spatial, temporal, and audio experts for balanced holistic video recognition.

DetailsMotivation: Most existing video recognition models lack balanced spatio-temporal understanding and don't effectively integrate audio information for holistic video understanding.

Method: Two-stream architecture with Cross-Attention in Space and Time (CAST) using RGB input, extended to Cross-Attention in Visual and Audio (CAVA) by adding audio expert. Uses Bottleneck Cross-Attention (B-CA) for information exchange between experts.

Result: Consistent balanced performance on EPIC-KITCHENS-100, Something-Something-V2, Kinetics-400; favorable performance on audio-visual benchmarks UCF-101, VGG-Sound, KineticsSound, EPIC-SOUNDS.

Conclusion: CA^2ST effectively combines spatial, temporal, and audio experts through cross-attention to achieve balanced and holistic video understanding.

Abstract: We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

[512] PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks

Abdelrahman Elskhawy, Mengze Li, Nassir Navab, Benjamin Busam

Main category: cs.CV

TL;DR: PRISM-0 is a zero-shot open-vocabulary scene graph generation framework that uses foundation models to generate diverse predicates without training bias, achieving performance comparable to state-of-the-art supervised methods.

DetailsMotivation: Address training bias in supervised SGG caused by limited data and long-tail predicate distributions, which leads to poor predicate diversity and degraded downstream performance.

Method: Bottom-up pipeline using foundation models: object pair detection, VLM description, LLM predicate generation (fine- and coarse-grained), and VQA validation. Modular, dataset-independent design.

Result: Achieves performance on par with state-of-the-art weakly-supervised models on SGG benchmarks and matches supervised methods in tasks like Sentence-to-Graph Retrieval. Enriches existing datasets with diverse, unbiased graphs.

Conclusion: PRISM-0 demonstrates that zero-shot open-vocabulary SGG using foundation models can overcome training bias and generate diverse predicates, performing competitively without requiring training data.

Abstract: In Scene Graph Generation (SGG), structured representations are extracted from visual inputs as object nodes and connecting predicates, enabling image-based reasoning for diverse downstream tasks. While fully supervised SGG has improved steadily, it suffers from training bias due to limited curated data and long-tail predicate distributions, leading to poor predicate diversity and degraded downstream performance. We present PRISM-0, a zero-shot open-vocabulary SGG framework that leverages foundation models in a bottom-up pipeline to capture a broad spectrum of predicates. Detected object pairs are filtered, described via a Vision-Language Model (VLM), and processed by a Large Language Model (LLM) to generate fine- and coarse-grained predicates, which are then validated by a Visual Question Answering (VQA) model. PRISM-0 modular, dataset-independent design enriches existing SGG datasets such as Visual Genome and produces diverse, unbiased graphs. While operating entirely in a zero-shot setting, PRISM-0 achieves performance on par with state-of-the-art weakly-supervised models on SGG benchmarks and even state-of-the-art supervised methods in tasks such as Sentence-to-Graph Retrieval.

[513] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim

Main category: cs.CV

TL;DR: Geometric Distillation is a lightweight framework that injects 3D geometric cues into pretrained Vision-Language Models without architectural changes, improving 3D spatial reasoning with low computational cost.

DetailsMotivation: VLMs have strong 2D understanding but fundamental limitations in 3D spatial structure comprehension, which restricts their application in spatially grounded multimodal tasks.

Method: Annotation-free fine-tuning that distills three types of geometric cues from 3D foundation models: sparse correspondences, relative depth relations, and dense cost volumes, while maintaining compatibility with natural image-text inputs.

Result: Consistently outperforms prior approaches on 3D vision-language reasoning and 3D perception benchmarks, achieving improved 3D spatial reasoning with significantly lower computational cost.

Conclusion: Demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, enabling wider use in spatially grounded multimodal tasks.

Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.

[514] SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang

Main category: cs.CV

TL;DR: SAM2MOT is a novel segmentation-driven multi-object tracking paradigm that replaces conventional detection-association frameworks, achieving state-of-the-art performance without fine-tuning.

DetailsMotivation: To break away from conventional detection-association frameworks in multi-object tracking and address challenges like false positives and occlusions by placing segmentation at the core of tracking.

Method: Integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning, using segmentation as the primary tracking mechanism.

Result: Achieves state-of-the-art results on DanceTrack (+2.1 HOTA, +4.5 IDF1), UAVDT, and BDD100K benchmarks, demonstrating effectiveness in handling complex tracking scenarios.

Conclusion: SAM2MOT significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems.

Abstract: Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT–a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.

[515] Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability

Jiani Liu, Zhiyuan Wang, Zeliang Zhang, Chao Huang, Susan Liang, Yunlong Tang, Chenliang Xu

Main category: cs.CV

TL;DR: The paper investigates how computational redundancy in Vision Transformers can be exploited to create more transferable adversarial examples, proposing several techniques that outperform existing methods.

DetailsMotivation: Vision Transformers have unique architectural properties that make adversarial examples more transferable than those from CNNs, suggesting structural characteristics favorable for transferable attacks. The authors aim to exploit computational redundancy in ViTs to improve adversarial attack effectiveness.

Method: The authors identify two forms of redundancy (data-level and model-level) and design techniques including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training.

Result: Extensive experiments on ImageNet-1k show the proposed methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.

Conclusion: Computational redundancy in Vision Transformers can be effectively harnessed to create more transferable and general adversarial examples, demonstrating superior performance compared to existing attack methods.

Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.

[516] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: OmniVDiff is a unified framework for controllable video diffusion that handles multiple visual modalities in a single model, supporting text-to-video generation, video understanding, and conditional video synthesis.

DetailsMotivation: To create a single diffusion model capable of synthesizing and comprehending multiple video visual content types, addressing the need for unified video generation and understanding systems.

Method: Treats all video visual modalities in color space to learn joint distributions, employs adaptive control strategy that dynamically adjusts each modality’s role (generation or conditioning) during diffusion process.

Result: Achieves state-of-the-art performance in video generation tasks and competitive results in video understanding, demonstrating flexibility for downstream applications like video-to-video translation and scene reconstruction.

Conclusion: OmniVDiff provides a scalable and flexible framework that effectively handles multiple video modalities in a unified manner, showing strong performance in both generation and understanding tasks.

Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. Our framework supports three key capabilities: (1) Text-conditioned video generation, where all modalities are jointly synthesized from a textual prompt; (2) Video understanding, where structural modalities are predicted from rgb inputs in a coherent manner; and (3) X-conditioned video generation, where video synthesis is guided by finegrained inputs such as depth, canny and segmentation. Extensive experiments demonstrate that OmniVDiff achieves state-of-the-art performance in video generation tasks and competitive results in video understanding. Its flexibility and scalability make it well-suited for downstream applications such as video-to-video translation, modality adaptation for visual tasks, and scene reconstruction.

[517] Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI

Mahdi Alehdaghi, Rajarshi Bhattacharya, Pourya Shamsolmoali, Rafael M. O. Cruz, Eric Granger

Main category: cs.CV

TL;DR: PCMNet is a part-prototypical concept mining network that learns human-comprehensible prototypes from image regions without extra supervision, providing structured concept-level explanations and improved robustness for AI interpretability.

DetailsMotivation: Address limitations of existing interpretability methods - GradCAM provides limited conceptual insight, while prototype-based approaches have rigid region selection and lack semantic consistency. Need for AI systems to remain understandable and aligned with human expectations.

Method: Proposes PCMNet that learns human-comprehensible prototypes from meaningful image regions without additional supervision, clusters prototypes into concept groups, and extracts concept activation vectors for structured explanations.

Result: Outperforms state-of-the-art methods in interpretability, stability, and robustness across multiple image classification benchmarks. Enhances robustness to occlusion and challenging conditions.

Conclusion: Contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems through structured concept-level explanations.

Abstract: As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.

[518] The Path to Reconciling Quality and Safety in Text-to-Image Generation: Dataset, Method, and Evaluation

Shouwei Ruan, Zhenyu Wu, Yao Huang, Ruochen Zhang, Yitong Sun, Caixin Kang, Shiji Zhao, Xingxing Wei

Main category: cs.CV

TL;DR: T2I-SPO framework addresses safety-quality trade-off in text-to-image models through dual-annotated dataset, synergistic optimization algorithm, and unified evaluation metric.

DetailsMotivation: Current text-to-image models face fundamental safety challenges with debilitating trade-offs between safety and generation quality, requiring systemic improvements across data, methods, and evaluation.

Method: Developed LibraAlign-100K dataset with dual safety-quality annotations; proposed Synergistic Preference Optimization (T2I-SPO) algorithm extending DPO with composite reward function; introduced Unified Alignment Score metric.

Result: Achieves state-of-the-art safety alignment against NSFW concepts while better maintaining generation quality and general model capabilities compared to existing methods.

Conclusion: The unified framework successfully mitigates the safety-quality trade-off through synergistic alignment across data, optimization methods, and evaluation protocols.

Abstract: Content safety is a fundamental challenge for text-to-image (T2I) models, yet prevailing methods enforce a debilitating trade-off between safety and generation quality. We argue that mitigating this trade-off hinges on addressing systemic challenges in current T2I safety alignment across data, methods, and evaluation protocols. To this end, we introduce a unified framework for synergistic safety alignment. First, to overcome the flawed data paradigm that provides biased optimization signals, we develop LibraAlign-100K, the first large-scale dataset with dual annotations for safety and quality. Second, to address the myopic optimization of existing methods focus solely on safety reward, we propose Synergistic Preference Optimization (T2I-SPO), a novel alignment algorithm that extends the DPO paradigm with a composite reward function that integrates generation safety and quality to holistically model user preferences. Finally, to overcome the limitations of quality-agnostic and binary evaluation in current protocols, we introduce the Unified Alignment Score, a holistic, fine-grained metric that fairly quantifies the balance between safety and generative capability. Extensive experiments demonstrate that T2I-SPO achieves state-of-the-art safety alignment against a wide range of NSFW concepts, while better maintaining the model’s generation quality and general capability

[519] Not All Attention Heads Are What You Need: Refining CLIP’s Image Representation with Attention Ablation

Feng Lin, Marco Chen, Haokui Zhang, Xiaotian Yu, Guangming Lu, Rong Xiao

Main category: cs.CV

TL;DR: AAT (Attention Ablation Technique) identifies and suppresses detrimental attention heads in CLIP’s image encoder to improve cross-modal retrieval performance by up to 11.1% with no extra inference cost.

DetailsMotivation: To investigate the role of attention heads in CLIP's image encoder and address the finding that certain distributed attention heads are detrimental to representation quality.

Method: Proposed Attention Ablation Technique (AAT) that systematically identifies harmful attention heads and suppresses them by directly manipulating their attention weights using two complementary strategies for different application scenarios.

Result: AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks with minimal overhead.

Conclusion: AAT effectively refines large-scale vision-language models with virtually no extra inference cost while yielding semantically meaningful patterns that align with existing interpretability findings.

Abstract: This paper investigates the role of attention heads in CLIP’s image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while yielding semantically meaningful patterns that align with existing interpretability findings.

[520] TAPIP3D: Tracking Any Point in Persistent 3D Geometry

Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki

Main category: cs.CV

TL;DR: TAPIP3D introduces a novel 3D point tracking method that uses camera-stabilized spatio-temporal feature clouds to track points in monocular RGB and RGB-D videos over long time horizons, outperforming both 3D and 2D tracking methods.

DetailsMotivation: Existing 2D and 3D point tracking methods struggle with long-term tracking due to camera motion and irregular 3D point distributions. The goal is to develop a more robust tracking approach that effectively handles camera movement and leverages 3D spatial information.

Method: Lifts 2D video features into 3D world space using depth and camera motion information to create camera-stabilized spatio-temporal feature clouds. Uses iterative multi-frame motion estimation and a novel 3D Neighborhood-to-Neighborhood (N2N) attention mechanism for spatially coherent feature neighborhoods.

Result: Significantly outperforms existing 3D point tracking methods and surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. Shows substantial gains in tracking robustness by compensating for camera motion.

Conclusion: TAPIP3D’s 3D-centric formulation with camera motion compensation and 3D attention mechanism provides strong and consistent performance across multiple 3D point tracking benchmarks, demonstrating the advantages of spatially grounded 3D representations over traditional 2D approaches.

Abstract: We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons. To handle the irregular structure of 3D point distributions, we propose a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism - a 3D-aware contextualization strategy that builds informative, spatially coherent feature neighborhoods to support precise trajectory estimation. Our 3D-centric formulation significantly improves performance over existing 3D point tracking methods and even surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. The model supports inference in both camera-centric (unstabilized) and world-centric (stabilized) coordinates, with experiments showing that compensating for camera motion leads to substantial gains in tracking robustness. By replacing the conventional 2D square correlation windows used in prior 2D and 3D trackers with a spatially grounded 3D attention mechanism, TAPIP3D achieves strong and consistent results across multiple 3D point tracking benchmarks. Project Page: https://tapip3d.github.io

[521] VistaDepth: Improving far-range Depth Estimation with Spectral Modulation and Adaptive Reweighting

Mingxia Zhan, Li Zhang, Yingjie Wang, Xiaomeng Chu, Beibei Wang, Yanyong Zhang

Main category: cs.CV

TL;DR: VistaDepth is a novel diffusion framework for monocular depth estimation that addresses limitations in far-range depth reconstruction through Latent Frequency Modulation and BiasMap mechanisms.

DetailsMotivation: Standard diffusion models for MDE struggle with far-range regions due to insufficient multi-scale processing for high-frequency details and training bias from the long-tail distribution of depth data favoring near-range regions.

Method: Proposes VistaDepth with two key innovations: 1) Latent Frequency Modulation (LFM) module that uses a lightweight network to predict dynamic spectral filters for refining latent features, and 2) BiasMap mechanism that adaptively reweights diffusion loss across timesteps to mitigate data bias.

Result: VistaDepth achieves state-of-the-art performance for diffusion-based MDE, particularly excelling in reconstructing detailed and accurate depth in far-range regions.

Conclusion: The proposed VistaDepth framework effectively addresses far-range depth estimation challenges through frequency-aware processing and adaptive loss reweighting, demonstrating superior performance in detailed depth reconstruction.

Abstract: Monocular depth estimation (MDE) aims to infer per-pixel depth from a single RGB image. While diffusion models have advanced MDE with impressive generalization, they often exhibit limitations in accurately reconstructing far-range regions. This difficulty arises from two key challenges. First, the implicit multi-scale processing in standard spatial-domain models can be insufficient for preserving the fine-grained, high-frequency details crucial for distant structures. Second, the intrinsic long-tail distribution of depth data imposes a strong training bias towards more prevalent near-range regions. To address these, we propose VistaDepth, a novel diffusion framework designed for balanced and accurate depth perception. We introduce two key innovations. First, the Latent Frequency Modulation (LFM) module enhances the model’s ability to represent high-frequency details. It operates by having a lightweight network predict a dynamic, content-aware spectral filter to refine latent features, thereby improving the reconstruction of distant structures. Second, our BiasMap mechanism introduces an adaptive reweighting of the diffusion loss strategically scaled across diffusion timesteps. It further aligns the supervision with the progressive denoising process, establishing a more consistent learning signal. As a result, it mitigates data bias without sacrificing training stability. Experiments show that VistaDepth achieves state-of-the-art performance for diffusion-based MDE, particularly excelling in reconstructing detailed and accurate depth in far-range regions.

[522] Almost Right: Making First-Layer Kernels Nearly Orthogonal Improves Model Generalization

Colton R. Crum, Adam Czajka

Main category: cs.CV

TL;DR: A lightweight regularization method that makes first-layer convolutional filters pairwise-orthogonal to reduce feature redundancy and improve generalization in open-set visual tasks, outperforming existing kernel orthogonalization approaches.

DetailsMotivation: CNNs have poor generalization in open-set tasks like biometric and medical applications, while humans excel at generalizing to unknown stimuli. Inspired by the efficient coding hypothesis from neuroscience, which suggests early visual structures minimize redundancy for information efficiency.

Method: Proposes a flexible approach that regularizes a subset of first-layer convolutional filters by making them pairwise-orthogonal, reducing feature redundancy without excessively constraining the network or increasing computational load.

Result: Evaluated on three open-set visual tasks (chest X-ray anomaly detection, synthetic face detection, iris presentation attack detection) and showed increased generalization capabilities compared to state-of-the-art kernel orthogonalization methods.

Conclusion: The proposed lightweight orthogonal regularization method effectively improves CNN generalization in open-set tasks while avoiding excessive constraints and computational overhead of existing approaches.

Abstract: Despite several algorithmic advances in the training of convolutional neural networks (CNNs) over the years, their generalization capabilities are still subpar across several pertinent domains, particularly within open-set tasks often found in biometric and medical contexts. On the contrary, humans have an uncanny ability to generalize to unknown visual stimuli. The efficient coding hypothesis posits that early visual structures (retina, Lateral Geniculate Nucleus, and primary visual cortex) transform inputs to reduce redundancy and maximize information efficiency. This mechanism of redundancy minimization in early vision was the inspiration for CNN regularization techniques that force convolutional kernels to be orthogonal. However, the existing works rely upon matrix projections, architectural modifications, or specific weight initializations, which frequently overtly constrain the network’s learning process and excessively increase the computational load during loss function calculation. In this paper, we introduce a flexible and lightweight approach that regularizes a subset of first-layer convolutional filters by making them pairwise-orthogonal, which reduces the redundancy of the extracted features but at the same time prevents putting excessive constraints on the network. We evaluate the proposed method on three open-set visual tasks (anomaly detection in chest X-ray images, synthetic face detection, and iris presentation attack detection) and observe an increase in the generalization capabilities of models trained with the proposed regularizer compared to state-of-the-art kernel orthogonalization approaches. We offer source codes along with the paper.

[523] FlexPara: Flexible Neural Surface Parameterization

Yuming Zhao, Qijian Zhang, Junhui Hou, Jiazhi Xia, Wenping Wang, Ying He

Main category: cs.CV

TL;DR: FlexPara is an unsupervised neural framework for flexible surface parameterization that establishes point-wise mappings between 3D surfaces and 2D UV coordinates without requiring manual cutting seams or high-quality mesh triangulation.

DetailsMotivation: Traditional parameterization methods need high-quality meshes, are limited to simple topologies, and require manual surface cutting. Optimal parameterization configurations vary with surface structures and tasks, requiring more flexible and controllable pipelines.

Method: Uses geometrically-interpretable sub-networks for cutting, deforming, unwrapping, and wrapping to create a bi-directional cycle mapping framework. Also includes multi-chart parameterization with adaptively-learned chart assignment.

Result: Extensive experiments show universality, superiority, and inspiring potential of the neural surface parameterization paradigm.

Conclusion: FlexPara provides a flexible neural approach for both global and multi-chart surface parameterizations that adapts to different surface structures without manual intervention.

Abstract: Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara

[524] Improving Small Drone Detection Through Multi-Scale Processing and Data Augmentation

Rayson Laroca, Marcelo dos Santos, David Menotti

Main category: cs.CV

TL;DR: A drone detection method using YOLOv11 with multi-scale processing, copy-paste data augmentation, and temporal consistency post-processing won 1st place in WOSDETC 2025 challenge.

DetailsMotivation: Detecting small drones that are visually similar to birds is critical for modern surveillance systems.

Method: Used YOLOv11 with multi-scale processing (whole image + segmented parts), copy-paste data augmentation for diverse drone/bird examples, and frame-to-frame consistency post-processing.

Result: Achieved first place in the 8th WOSDETC Drone-vs-Bird Detection Grand Challenge at IJCNN 2025.

Conclusion: The proposed approach effectively detects drones in complex environments by combining multi-scale processing, data augmentation, and temporal consistency techniques.

Abstract: Detecting small drones, often indistinguishable from birds, is crucial for modern surveillance. This work introduces a drone detection methodology built upon the medium-sized YOLOv11 object detection model. To enhance its performance on small targets, we implemented a multi-scale approach in which the input image is processed both as a whole and in segmented parts, with subsequent prediction aggregation. We also utilized a copy-paste data augmentation technique to enrich the training dataset with diverse drone and bird examples. Finally, we implemented a post-processing technique that leverages frame-to-frame consistency to mitigate missed detections. The proposed approach attained first place in the 8th WOSDETC Drone-vs-Bird Detection Grand Challenge, held at the 2025 International Joint Conference on Neural Networks (IJCNN), showcasing its capability to detect drones in complex environments effectively.

[525] Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction

Junlong Ren, Gangjian Zhang, Yu Hu, Jian Shu, Hui Xiong, Hao Wang

Main category: cs.CV

TL;DR: A novel PRVR framework that addresses semantic asymmetry by exploiting inter-sample correlation and intra-sample redundancy through three modules: ICE for pseudo-positive pairs, IRM for redundancy mitigation, and TCP for temporal coherence.

DetailsMotivation: Existing PRVR methods coarsely align videos and text, neglecting the critical cross-modal dual nature of inter-sample correlation and intra-sample redundancy that arises from semantic asymmetry between textual and visual modalities.

Method: Three core modules: 1) ICE captures inter-sample correlation using semantically similar unpaired text-video combinations; 2) IRM mitigates intra-sample redundancy by distinguishing relevant from redundant moments; 3) TCP enhances moment-level semantics by predicting temporal order of shuffled frames.

Result: Extensive experiments demonstrate state-of-the-art performance in partially relevant video retrieval tasks.

Conclusion: The proposed framework effectively addresses semantic asymmetry in PRVR by systematically exploiting inter-sample correlation and intra-sample redundancy, achieving superior retrieval performance.

Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant moment features and distinguishing them from query-relevant moments, encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances discrimination of fine-grained moment-level semantics by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments demonstrate the superiority of our method, achieving state-of-the-art results.

[526] TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang

Main category: cs.CV

TL;DR: TransPrune is a training-free token pruning method for Large Vision-Language Models that uses Token Transition Variation and Instruction-Guided Attention to efficiently reduce visual tokens while maintaining performance.

DetailsMotivation: Large Vision-Language Models face high computational costs due to excessive visual tokens, and existing attention-based pruning methods suffer from limitations like positional bias.

Method: Proposes TransPrune which progressively prunes tokens using Token Transition Variation (measuring changes in token representation magnitude/direction) and Instruction-Guided Attention (measuring instruction attention to image tokens).

Result: Achieves comparable multimodal performance to original LVLMs across eight benchmarks while reducing inference TFLOPs by more than half. TTV alone performs comparably to attention-based methods.

Conclusion: Token transition variation provides an effective alternative to attention-based importance criteria, enabling efficient token pruning without training while maintaining model performance.

Abstract: Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

[527] Physics-Guided Image Dehazing Diffusion

Shijun Zhou, Xing Xie, Baojie Fan, Jiandong Tian

Main category: cs.CV

TL;DR: IDDM is a diffusion model that incorporates atmospheric scattering physics to bridge the domain gap between synthetic and real-world hazy images, enabling effective dehazing of real-world images despite training only on synthetic data.

DetailsMotivation: Current dehazing algorithms trained on synthetic datasets struggle to generalize to real-world scenarios due to the domain gap between synthetic and real hazy images.

Method: Propose Image Dehazing Diffusion Models (IDDM) that incorporates atmospheric scattering model into noise diffusion, using gradual haze formation to help denoising Unet learn clear image distribution from hazy inputs. Includes specialized training strategy with simultaneous haze and noise introduction in forward process.

Result: IDDM shows domain generalization ability and effectively restores real-world hazy images despite being trained only on synthetic datasets, outperforming state-of-the-art approaches in quantitative and qualitative comparisons.

Conclusion: The proposed physics-guided diffusion model successfully bridges the synthetic-to-real domain gap for image dehazing, demonstrating robust performance on real-world images through incorporation of atmospheric scattering principles.

Abstract: Due to the domain gap between real-world and synthetic hazy images, current data-driven dehazing algorithms trained on synthetic datasets perform well on synthetic data but struggle to generalize to real-world scenarios. To address this challenge, we propose \textbf{I}mage \textbf{D}ehazing \textbf{D}iffusion \textbf{M}odels (IDDM), a novel diffusion process that incorporates the atmospheric scattering model into noise diffusion. IDDM aims to use the gradual haze formation process to help the denoising Unet robustly learn the distribution of clear images from the conditional input hazy images. We design a specialized training strategy centered around IDDM. Diffusion models are leveraged to bridge the domain gap from synthetic to real-world, while the atmospheric scattering model provides physical guidance for haze formation. During the forward process, IDDM simultaneously introduces haze and noise into clear images, and then robustly separates them during the sampling process. By training with physics-guided information, IDDM shows the ability of domain generalization, and effectively restores the real-world hazy images despite being trained on synthetic datasets. Extensive experiments demonstrate the effectiveness of our method through both quantitative and qualitative comparisons with state-of-the-art approaches.

[528] MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Yuqi Pang, Bowen Yang, Yun Cao, Rong Fan, Xiaoyu Li, Chen He

Main category: cs.CV

TL;DR: MoCHA is a novel VLLM framework that integrates multiple vision backbones with sparse Mixture of Experts Connectors and Hierarchical Group Attention to efficiently handle complex visual information while reducing training/inference costs.

DetailsMotivation: Current VLLMs face high training/inference costs and challenges in extracting visual details and bridging modalities effectively.

Method: Integrates CLIP, SigLIP, DINOv2 and ConvNeXt vision backbones with sparse MoECs module for dynamic expert selection, plus Hierarchical Group Attention with intra/inter-group operations and adaptive gating.

Result: Outperforms state-of-the-art models, showing 3.25% improvement in POPE for hallucination mitigation and 153-point increase on MME for visual instruction following.

Conclusion: MoCHA effectively addresses VLLM limitations through multi-backbone integration and specialized attention mechanisms, with ablation studies confirming the robustness of MoECs and HGA.

Abstract: Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.

[529] Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction

Xinzhu Ma, Cheng Wang, Chen Tang, Bin Wang, Shixiang Tang, Yuan Meng, Yunhong Wang, Di Huang

Main category: cs.CV

TL;DR: Point2Primitive directly predicts parametric CAD primitives from point clouds using transformer-based explicit position queries, avoiding the precision issues of implicit neural representations.

DetailsMotivation: Implicit neural representations like SDFs struggle with precision in CAD model recovery, leading to curved edges and non-editable models.

Method: Treats sketch reconstruction as set prediction with improved transformer decoder using explicit position queries to directly detect and predict sketch curves from point clouds.

Result: Significantly outperforms implicit methods in both primitive accuracy and overall geometric fidelity.

Conclusion: Direct prediction paradigm is superior to implicit methods for CAD model recovery from point clouds.

Abstract: Recovering CAD models from point clouds requires reconstructing their topology and sketch-based extrusion primitives. A dominant paradigm for representing sketches involves implicit neural representations such as Signed Distance Fields (SDFs). However, this indirect approach inherently struggles with precision, leading to unintended curved edges and models that are difficult to edit. In this paper, we propose Point2Primitive, a framework that learns to directly predict the explicit, parametric primitives of CAD models. Our method treats sketch reconstruction as a set prediction problem, employing a improved transformer-based decoder with explicit position queries to directly detect and predict the fundamental sketch curves (i.e., type and parameter) from the point cloud. Instead of approximating a continuous field, we formulate curve parameters as explicit position queries, which are optimized autoregressively to achieve high accuracy. The overall topology is rebuilt via extrusion segmentation. Extensive experiments demonstrate that this direct prediction paradigm significantly outperforms implicit methods in both primitive accuracy and overall geometric fidelity.

[530] Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie

Main category: cs.CV

TL;DR: INSIGHT is a two-stage framework for long-term action anticipation from egocentric video that addresses limitations in existing methods by leveraging hand-object interactions, verb-noun dependencies, and explicit cognitive reasoning.

DetailsMotivation: Existing approaches for long-term action anticipation from egocentric video underutilize fine-grained visual cues, neglect semantic dependencies between verbs and nouns, and lack explicit cognitive reasoning, limiting generalization and forecasting ability.

Method: A unified two-stage framework: 1) extracts semantically rich features from hand-object interactions and enhances action representations using verb-noun co-occurrence matrix; 2) uses reinforcement learning-based module simulating cognitive reasoning through visual perception -> intention inference -> action anticipation.

Result: Achieves state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks, demonstrating effectiveness and strong generalization capability.

Conclusion: INSIGHT successfully addresses key limitations in egocentric action anticipation by integrating hand-object interaction features, semantic dependencies, and explicit cognitive reasoning, leading to improved performance and generalization.

Abstract: Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

[531] SITE: towards Spatial Intelligence Thorough Evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, Boqing Gong

Main category: cs.CV

TL;DR: SITE is a benchmark dataset for evaluating spatial intelligence in vision-language models across diverse visual modalities and spatial factors, revealing that current models lag behind humans in spatial orientation and showing correlation with embodied AI performance.

DetailsMotivation: To address the need for standardized evaluation of spatial intelligence in AI systems, which is crucial for applications from neuroscience to robotics, as existing benchmarks lack comprehensive coverage of spatial intelligence factors.

Method: Combined bottom-up survey of 31 existing datasets with top-down strategy using cognitive science classification systems to design novel tasks about view-taking and dynamic scenes, creating a multi-choice visual question-answering benchmark.

Result: Leading vision-language models significantly underperform human experts, particularly in spatial orientation. A positive correlation was found between spatial reasoning proficiency and performance on embodied AI tasks.

Conclusion: The SITE benchmark effectively reveals gaps in current AI spatial intelligence capabilities and demonstrates the importance of spatial reasoning for embodied AI applications, highlighting areas for future model improvement.

Abstract: Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’ spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model’s spatial reasoning proficiency and its performance on an embodied AI task.

[532] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma, Zhuang Li, John A Taylor

Main category: cs.CV

TL;DR: Landsat30-AU is a large-scale vision-language dataset for satellite imagery, addressing gaps in existing datasets by including low-resolution, multi-satellite, long-term Landsat data spanning 36 years over Australia.

DetailsMotivation: To democratize Earth observation by making satellite data accessible to non-specialists and enabling planet-scale automation, as existing datasets focus mainly on short-term, high-resolution imagery from limited satellites.

Method: Created Landsat30-AU dataset with two components: Landsat30-AU-Cap (196,262 image-caption pairs) and Landsat30-AU-VQA (17,725 human-verified VQA samples) across eight remote sensing domains, using a bootstrapped pipeline with generic VLMs and iterative refinement.

Result: Off-the-shelf VLMs struggle with satellite imagery (EarthDial: 0.07 SPIDEr captioning, 0.48 VQA accuracy). Fine-tuning Qwen2.5-VL-7B on Landsat30-AU improves captioning from 0.11 to 0.31 SPIDEr and VQA accuracy from 0.74 to 0.87.

Conclusion: Landsat30-AU addresses critical gaps in satellite vision-language datasets and enables significant performance improvements through domain-specific fine-tuning, advancing accessible Earth observation capabilities.

Abstract: Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

[533] FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Hongyang Wang, Yichen Shi, Zhuofu Tao, Yuhao Gao, Liepiao Zhang, Xun Lin, Jun Feng, Xiaochen Yuan, Zitong Yu, Xiaochun Cao

Main category: cs.CV

TL;DR: FaceShield is a multimodal large language model designed for face anti-spoofing that can detect fake faces, identify attack types, provide reasoning, and locate attack areas, outperforming previous methods on multiple benchmarks.

DetailsMotivation: Previous face anti-spoofing methods lacked interpretability and reasoning capabilities. Multimodal large language models show strong potential for visual tasks but no specialized model existed for FAS tasks.

Method: Uses spoof-aware vision perception with original images and auxiliary information, plus prompt-guided vision token masking to improve generalization. Pre-trained on FaceShield-pre10K and fine-tuned on FaceShield-sft45K datasets.

Result: Significantly outperforms previous deep learning models and general MLLMs on four FAS tasks: coarse-grained classification, fine-grained classification, reasoning, and attack localization across three benchmark datasets.

Conclusion: FaceShield provides a comprehensive MLLM solution for face anti-spoofing with strong performance and interpretability, addressing the gap in specialized FAS models.

Abstract: Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model’s generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released at https://github.com/Why0912/FaceShield.

[534] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris

Main category: cs.CV

TL;DR: HierarchicalPrune is a compression framework for billion-scale text-to-image diffusion models that reduces memory usage by 77.5-80.4% and latency by 27.9-38.0% while maintaining output quality through hierarchical pruning and sensitivity-guided distillation.

DetailsMotivation: Large diffusion models (8-11B parameters) are challenging to deploy on resource-constrained devices due to their massive parameter scale and computational demands.

Method: Three synergistic techniques: Hierarchical Position Pruning (removes less essential later blocks), Positional Weight Preservation (protects early semantic blocks), and Sensitivity-Guided Distillation (adjusts knowledge transfer based on block sensitivity).

Result: Achieves 77.5-80.4% memory reduction (15.8GB to 3.2GB), 27.9-38.0% latency reduction, with minimal quality drop (2.6% GenEval, 7% HPSv2). User study shows perceptual quality comparable to original model.

Conclusion: HierarchicalPrune successfully compresses billion-scale diffusion models for on-device inference while preserving output quality, significantly outperforming prior compression methods.

Abstract: State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Finally, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

[535] Self-NPO: Data-Free Diffusion Model Enhancement via Truncated Diffusion Fine-Tuning

Fu-Yun Wang, Keqiang Sun, Yao Teng, Xihui Liu, Jiale Yuan, Jiaming Song, Hongsheng Li

Main category: cs.CV

TL;DR: Self-NPO introduces a data-free negative preference optimization method for diffusion models that learns from the model itself without manual labeling or reward models, achieving comparable performance to Diffusion-NPO at less than 1% training cost.

DetailsMotivation: Existing preference optimization methods focus on producing favorable outputs but overlook classifier-free guidance's role in mitigating undesirable results. Prior negative preference optimization approaches require costly explicit preference annotations, limiting practicality in data-scarce domains.

Method: Self-NPO uses truncated diffusion fine-tuning, a data-free approach that directly learns from the model itself to perform negative preference optimization, eliminating the need for manual data labeling or reward model training.

Result: The method is highly efficient (less than 1% training cost of Diffusion-NPO) and achieves comparable performance. It integrates seamlessly with SD1.5, SDXL, CogVideoX, and preference-optimized models, enhancing both generation quality and human preference alignment.

Conclusion: Self-NPO provides an efficient, data-free alternative to existing negative preference optimization methods, making preference alignment more practical and accessible across various diffusion models without requiring costly annotation procedures.

Abstract: Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes through CFG. However, prior NPO approaches rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we propose Self-NPO, specifically truncated diffusion fine-tuning, a data-free approach of negative preference optimization by directly learning from the model itself, eliminating the need for manual data labeling or reward model training. This data-free approach is highly efficient (less than 1% training cost of Diffusion-NPO) and achieves comparable performance to Diffusion-NPO in a data-free manner. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences. Code is available at https://github.com/G-U-N/Diffusion-NPO.

[536] Robust Drone-View Geo-Localization via Content-Viewpoint Disentanglement

Ke Li, Di Wang, Xiaowei Wang, Zhihong Wu, Yiming Zhang, Yifeng Wang, Quan Wang

Main category: cs.CV

TL;DR: CVD is a drone-view geo-localization framework that explicitly disentangles content and viewpoint factors using mutual information minimization and cross-view reconstruction constraints, improving robustness across different scenarios and viewpoints.

DetailsMotivation: Existing DVGL methods assume direct alignment of drone and satellite images in shared feature space, but overlook viewpoint-induced conflicts that cause inconsistent features and hinder precise localization.

Method: Models cross-view feature space as a composite manifold of content and viewpoint factors. Uses intra-view independence constraint (minimizing mutual information) and inter-view reconstruction constraint (cross-combining factors from paired images) for effective disentanglement.

Result: Extensive experiments on University-1652 and SUES-200 show strong robustness and generalization across scenarios, viewpoints and altitudes. Further evaluations on CVUSA and CVACT confirm consistent improvements.

Conclusion: CVD effectively addresses viewpoint discrepancies in DVGL through explicit factor disentanglement, integrates seamlessly as plug-and-play module, reduces inference latency, and demonstrates superior performance across multiple benchmarks.

Abstract: Drone-view geo-localization (DVGL) aims to match images of the same geographic location captured from drone and satellite perspectives. Despite recent advances, DVGL remains challenging due to significant appearance changes and spatial distortions caused by viewpoint variations. Existing methods typically assume that drone and satellite images can be directly aligned in a shared feature space via contrastive learning. Nonetheless, this assumption overlooks the inherent conflicts induced by viewpoint discrepancies, resulting in extracted features containing inconsistent information that hinders precise localization. In this study, we take a manifold learning perspective and model $\textit{the feature space of cross-view images as a composite manifold jointly governed by content and viewpoint}$. Building upon this insight, we propose $\textbf{CVD}$, a new DVGL framework that explicitly disentangles $\textit{content}$ and $\textit{viewpoint}$ factors. To promote effective disentanglement, we introduce two constraints: $\textit{(i)}$ an intra-view independence constraint that encourages statistical independence between the two factors by minimizing their mutual information; and $\textit{(ii)}$ an inter-view reconstruction constraint that reconstructs each view by cross-combining $\textit{content}$ and $\textit{viewpoint}$ from paired images, ensuring factor-specific semantics are preserved. As a plug-and-play module, CVD integrates seamlessly into existing DVGL pipelines and reduces inference latency. Extensive experiments on University-1652 and SUES-200 show that CVD exhibits strong robustness and generalization across various scenarios, viewpoints and altitudes, with further evaluations on CVUSA and CVACT confirming consistent improvements.

[537] edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

Chen Qian, Xinran Yu, Zewen Huang, Danyang Li, Qiang Ma, Fan Dang, Xuan Ding, Guangyong Shang, Zheng Yang

Main category: cs.CV

TL;DR: Proposes Context Transfer paradigm for cloud-edge VLM collaboration, treating delayed LVLM outputs as historical context to guide real-time SVLM inference, with edgeVLM implementation showing effectiveness across multiple tasks.

DetailsMotivation: Existing cloud-edge collaborative architectures for VLMs fail to handle cloud latency fluctuations and don't leverage delayed but accurate LVLM responses for real-time guidance.

Method: Context Transfer paradigm that uses delayed LVLM outputs as historical context for SVLM inference, with edgeVLM implementation featuring context replacement and visual focus modules to refine text input and enhance visual grounding.

Result: Extensive experiments on three real-time vision-language reasoning tasks across four datasets demonstrate the framework’s effectiveness.

Conclusion: The new paradigm provides groundwork for more effective and latency-aware collaboration strategies in future VLM systems.

Abstract: Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-lanuage reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.

[538] Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency

Bo Yang, Hengwei Zhang, Jindong Wang, Yuchen Ren, Chenhao Lin, Chao Shen, Zhengyu Zhao

Main category: cs.CV

TL;DR: SEA is a selective ensemble attack that dynamically chooses diverse surrogate models across iterations to boost transferability without sacrificing efficiency, overcoming the traditional trade-off between these two factors.

DetailsMotivation: Existing surrogate ensemble attacks face a trade-off where using more models improves transferability but reduces efficiency. This limitation persists despite the availability of many pre-trained models online.

Method: SEA dynamically selects diverse models from accessible pre-trained models across iterations, decoupling within-iteration and cross-iteration model diversity. It maintains a fixed number of within-iteration models for efficiency while increasing cross-iteration diversity for transferability.

Result: Experiments on ImageNet show SEA achieves 8.5% higher transferability than existing attacks under the same efficiency when selecting 4 from 20 models. The method also generalizes well to commercial vision APIs and large vision-language models.

Conclusion: SEA enables adaptive balancing of transferability and efficiency according to specific resource requirements, opening new possibilities for efficient ensemble attacks.

Abstract: In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be \textit{identical} across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model diversity. In this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.

[539] Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong

Main category: cs.CV

TL;DR: PAR is a unified autoregressive framework for panoramic image generation that addresses limitations of diffusion models and task separation in existing approaches.

DetailsMotivation: Existing panoramic generation methods have two key limitations: diffusion models violate i.i.d. assumptions in equirectangular projections, and text/image conditioning are treated as separate tasks with distinct architectures.

Method: Proposed Panoramic AutoRegressive model (PAR) using masked autoregressive modeling, circular padding for spatial coherence, and consistency alignment strategy for improved quality.

Result: Extensive experiments show competitive performance in text-to-image generation and panorama outpainting, with promising scalability and generalization capabilities.

Conclusion: PAR provides a unified solution that overcomes fundamental limitations of diffusion models for panoramic generation while integrating text and image conditioning in a cohesive architecture.

Abstract: Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

[540] Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

Hancong Jin, Zihan Cao, Liang-jian Deng, Jingjing Li

Main category: cs.CV

TL;DR: KSDiff is a fast kernel-space diffusion framework that generates convolutional kernels with global context for pansharpening, achieving superior performance with 500x faster inference than diffusion-based baselines.

DetailsMotivation: Existing deep learning methods for pansharpening fail to capture global priors in remote sensing data, while diffusion models suffer from heavy inference latency despite their powerful distribution mapping capabilities.

Method: KSDiff constructs convolutional kernels through integration of a low-rank core tensor generator and unified factor generator, orchestrated by structure-aware multi-head attention, with a two-stage training strategy for pansharpening.

Result: KSDiff achieves superior performance compared to recent methods and demonstrates over 500x faster inference than diffusion-based pansharpening baselines.

Conclusion: The proposed KSDiff framework effectively enhances pansharpening quality while significantly accelerating inference, with ablation studies and evaluations confirming its effectiveness.

Abstract: Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

[541] Towards Cross-Domain Multi-Targeted Adversarial Attacks

Taïga Gonçalves, Tomo Miyazaki, Shinichiro Omachi

Main category: cs.CV

TL;DR: CD-MTA enables cross-domain multi-targeted adversarial attacks using only a single target image, without requiring access to victim model’s training data or predefined target classes.

DetailsMotivation: Existing multi-targeted attacks are limited to predefined target classes and require access to victim model's training data, raising privacy concerns in black-box scenarios.

Method: Uses Feature Injection Module (FIM) and class-agnostic objectives to extract transferable features from a single target image, enabling attacks on arbitrary unseen classes across different datasets.

Result: Outperforms existing methods on ImageNet and seven additional datasets for unseen target classes in black-box and cross-domain scenarios.

Conclusion: CD-MTA provides a practical solution for cross-domain targeted attacks without data leakage risks, using only visual target representations.

Abstract: Multi-targeted adversarial attacks aim to mislead classifiers toward specific target classes using a single perturbation generator with a conditional input specifying the desired target class. Existing methods face two key limitations: (1) a single generator supports only a limited number of predefined target classes, and (2) it requires access to the victim model’s training data to learn target class semantics. This dependency raises data leakage concerns in practical black-box scenarios where the training data is typically private. To address these limitations, we propose a novel Cross-Domain Multi-Targeted Attack (CD-MTA) that can generate perturbations toward arbitrary target classes, even those that do not exist in the attacker’s training data. CD-MTA is trained on a single public dataset but can perform targeted attacks on black-box models trained on different datasets with disjoint and unknown class sets. Our method requires only a single example image that visually represents the desired target class, without relying its label, class distribution or pretrained embeddings. We achieve this through a Feature Injection Module (FIM) and class-agnostic objectives which guide the generator to extract transferable, fine-grained features from the target image without inferring class semantics. Experiments on ImageNet and seven additional datasets show that CD-MTA outperforms existing multi-targeted attack methods on unseen target classes in black-box and cross-domain scenarios. The code is available at https://github.com/tgoncalv/CD-MTA.

[542] Vision Transformers with Self-Distilled Registers

Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, Andrew F. Luo

Main category: cs.CV

TL;DR: PH-Reg is an efficient self-distillation method that adds register tokens to existing pre-trained Vision Transformers without full retraining, reducing artifact tokens and improving performance on segmentation and depth prediction tasks.

DetailsMotivation: Existing large-scale pre-trained ViTs suffer from artifact tokens that degrade performance in fine-grained localization tasks, but full retraining is infeasible due to model size.

Method: Self-distillation approach where teacher (frozen pre-trained ViT) and student (same ViT + register tokens) are initialized from same model. Teacher generates denoised embeddings via test-time augmentation, which optimize small subset of student weights.

Result: Effectively reduces artifact tokens and improves segmentation and depth prediction performance under zero-shot and linear probing settings.

Conclusion: PH-Reg provides an efficient way to integrate registers into existing ViTs without full retraining, addressing artifact token issues and enhancing model performance on localization tasks.

Abstract: Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly “absorb” the artifact term during training.Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher’s inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

[543] Realism Control One-step Diffusion for Real-World Image Super-Resolution

Zongliang Wu, Siming Zheng, Peng-Tao Jiang, Xin Yuan

Main category: cs.CV

TL;DR: RCOD is a one-step diffusion framework for real-world image super-resolution that enables explicit control over fidelity-realism trade-offs through latent domain grouping and degradation-aware sampling.

DetailsMotivation: One-step diffusion methods for super-resolution lack flexible control mechanisms to balance fidelity and realism, which are inherently manageable in multi-step methods through step adjustment.

Method: Proposes Realism Controlled One-step Diffusion (RCOD) with latent domain grouping strategy, degradation-aware sampling, and visual prompt injection module to replace text prompts with degradation-aware visual tokens.

Result: Achieves superior fidelity and perceptual quality while maintaining computational efficiency, outperforming state-of-the-art OSD methods in both quantitative metrics and visual quality.

Conclusion: RCOD provides flexible realism control capabilities during inference while maintaining the efficiency of one-step diffusion methods for real-world image super-resolution.

Abstract: Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage.

[544] SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone

Main category: cs.CV

TL;DR: SANSA repurposes SAM2 for few-shot segmentation by making its latent semantic structure explicit, achieving state-of-the-art performance with minimal modifications while maintaining speed and flexibility.

DetailsMotivation: SAM2 has strong segmentation capabilities and feature matching but its representations are entangled with task-specific cues optimized for object tracking, which impairs semantic understanding needed for few-shot segmentation.

Method: Proposes SANSA framework that makes SAM2’s latent semantic structure explicit through minimal task-specific modifications, repurposing it for few-shot segmentation while supporting various prompts (points, boxes, scribbles).

Result: Achieves state-of-the-art performance on few-shot segmentation benchmarks for generalization assessment, outperforms generalist methods in in-context setting, remains significantly faster and more compact than prior approaches.

Conclusion: SAM2 already encodes rich semantic structure that can be effectively leveraged for few-shot segmentation with minimal modifications, demonstrating strong performance while maintaining efficiency and flexibility.

Abstract: Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.

[545] Probabilistic Robustness Analysis in High Dimensional Space: Application to Semantic Segmentation Network

Navid Hashemi, Samuel Sasaki, Diego Manzanas Lopez, Lars Lindemann, Ipek Oguz, Meiyi Ma, Taylor T. Johnson

Main category: cs.CV

TL;DR: A scalable probabilistic verification framework using conformal inference with a novel clipping block technique to provide reliable safety guarantees for semantic segmentation networks while reducing conservatism.

DetailsMotivation: Existing probabilistic verification methods fail to scale with modern segmentation tasks and produce overly conservative guarantees of limited practical value for safety-critical applications like medical imaging and autonomous driving.

Method: Architecture-agnostic framework using conformal inference enhanced by a novel clipping block technique to provide provable guarantees while mitigating excessive conservatism.

Result: Experiments on large-scale segmentation models across multiple datasets (CamVid, OCTA-500, Lung Segmentation, Cityscapes) show reliable safety guarantees with substantially reduced conservatism compared to state-of-the-art approaches.

Conclusion: The proposed framework delivers scalable, reliable probabilistic verification for semantic segmentation networks while addressing the conservatism limitations of prior methods, with code available for reproducibility.

Abstract: Semantic segmentation networks (SSNs) are central to safety-critical applications such as medical imaging and autonomous driving, where robustness under uncertainty is essential. However, existing probabilistic verification methods often fail to scale with the complexity and dimensionality of modern segmentation tasks, producing guarantees that are overly conservative and of limited practical value. We propose a probabilistic verification framework that is architecture-agnostic and scalable to high-dimensional input-output spaces. Our approach employs conformal inference (CI), enhanced by a novel technique that we call the \textbf{clipping block}, to provide provable guarantees while mitigating the excessive conservatism of prior methods. Experiments on large-scale segmentation models across CamVid, OCTA-500, Lung Segmentation, and Cityscapes demonstrate that our framework delivers reliable safety guarantees while substantially reducing conservatism compared to state-of-the-art approaches on segmentation tasks. We also provide a public GitHub repository (https://github.com/Navidhashemicodes/SSN_Reach_CLP_Surrogate) for this approach, to support reproducibility.

[546] Task-Driven Implicit Representations for Automated Design of LiDAR Systems

Nikhil Behari, Aaron Young, Tzofi Klinghoffer, Akshat Dave, Ramesh Raskar

Main category: cs.CV

TL;DR: Automated framework for task-driven LiDAR system design using continuous 6D design space and flow-based generative modeling

DetailsMotivation: LiDAR design is complex, time-consuming, and manual, requiring unique spatial and temporal sampling considerations

Method: Represent LiDAR configurations in 6D design space, learn task-specific implicit densities via flow-based generative modeling, and synthesize systems using parametric distributions fitted via expectation-maximization

Result: Validated on diverse 3D vision tasks including face scanning, robotic tracking, and object detection

Conclusion: Enables automated, constraint-aware LiDAR system design for real-world applications

Abstract: Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.

[547] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang

Main category: cs.CV

TL;DR: ZPressor is a lightweight module that enables feed-forward 3D Gaussian Splatting models to scale to over 100 input views by compressing multi-view inputs into compact latent states, improving performance and robustness.

DetailsMotivation: Feed-forward 3DGS models face scalability constraints with limited model capacity, leading to degraded performance or excessive memory consumption as input views increase.

Method: ZPressor partitions views into anchor and support sets, using cross attention to compress information from support views into anchor views, forming a compact latent state Z that retains essential scene information while discarding redundancy.

Result: Enables scaling to over 100 input views at 480P resolution on 80GB GPU, consistently improves performance under moderate input views and enhances robustness under dense view settings on DL3DV-10K and RealEstate10K benchmarks.

Conclusion: ZPressor provides an effective solution for scaling feed-forward 3DGS models to handle large numbers of input views while maintaining performance and efficiency.

Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their models, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.

[548] Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: RSKT-Seg is a novel open-vocabulary segmentation framework for remote sensing that addresses domain gaps through multi-directional cost map aggregation, efficient fusion transformers, and knowledge transfer, achieving significant performance improvements over existing methods.

DetailsMotivation: Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS) is underexplored due to lack of standardized benchmarks and domain gaps between natural and remote sensing images. Existing OVS models perform poorly when directly applied to remote sensing scenarios.

Method: Proposed RSKT-Seg framework with three key components: 1) Multi-Directional Cost Map Aggregation (RS-CMA) for rotation-invariant visual cues, 2) Efficient Cost Map Fusion (RS-Fusion) transformer for spatial-semantic modeling, and 3) Remote Sensing Knowledge Transfer (RS-Transfer) for domain adaptation.

Result: Extensive experiments show RSKT-Seg outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation.

Conclusion: RSKT-Seg effectively bridges the domain gap in open-vocabulary remote sensing segmentation and establishes a new state-of-the-art performance with improved efficiency.

Abstract: Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

[549] Video Signature: Implicit Watermarking for Video Diffusion Models

Yu Huang, Junhao Chen, Shuliang Liu, Hanqian Li, Jungang Li, Qi Zheng, Aiwei Liu, Yi R. Fung, Xuming Hu

Main category: cs.CV

TL;DR: VidSig is an implicit watermarking method for video diffusion models that embeds watermarks during generation with minimal latency, achieving high extraction accuracy while maintaining video quality and temporal consistency.

DetailsMotivation: Address the limitations of existing video watermarking methods - post-generation approaches struggle with quality/extraction trade-offs, while in-generation methods using Gaussian noise embedding incur high computational costs.

Method: Partially fine-tunes the latent decoder with Perturbation-Aware Suppression (PAS) to freeze sensitive layers, and adds a Temporal Alignment module for frame coherence. Enables adaptive watermark integration during generation.

Result: Achieves best trade-off among watermark extraction accuracy, video quality, and latency. Shows strong robustness against spatial/temporal tampering, and stability across different video lengths and resolutions.

Conclusion: VidSig provides a practical solution for AIGC video protection with imperceptible watermarking, minimal computational overhead, and reliable tracing capabilities suitable for real-world applications.

Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation, but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, yet existing methods for video generation mainly follow a post-generation paradigm, which often fails to effectively balance the trade-off between video quality and watermark extraction. Meanwhile, current in-generation methods that embed the watermark into the initial Gaussian noise usually incur substantial additional computation. To address these issues, we propose \textbf{Video Signature} (\textsc{VidSig}), an implicit watermarking method for video diffusion models that enables imperceptible and adaptive watermark integration during video generation with almost no extra latency. Specifically, we partially fine-tune the latent decoder, where \textbf{Perturbation-Aware Suppression} (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight \textbf{Temporal Alignment} module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that \textsc{VidSig} achieves the best trade-off among watermark extraction accuracy, video quality, and watermark latency. It also demonstrates strong robustness against both spatial and temporal tamper, and remains stable across different video lengths and resolutions, highlighting its practicality in real-world scenarios.

[550] Generative Perception of Shape and Material from Differential Motion

Xinran Nicole Han, Ko Nishino, Todd Zickler

Main category: cs.CV

TL;DR: A conditional denoising-diffusion model that generates shape-and-material maps from short object motion videos, capturing visual ambiguities and improving with motion.

DetailsMotivation: Humans resolve shape-material ambiguities by moving their head or rotating objects, inspiring the use of differential motion videos to improve visual perception beyond single-view limitations.

Method: Parameter-efficient denoising-diffusion architecture trained directly in pixel-space on synthetic object-motion videos with shape and material supervision, generating multiple disentangled attributes simultaneously.

Result: For static observations, produces diverse multimodal predictions capturing ambiguities; with object motion, distributions converge to more accurate explanations; achieves high-quality estimates for real-world objects.

Conclusion: Moving beyond single-view to continuous motion observations with generative perception effectively captures visual ambiguities and improves visual reasoning in physically-embodied systems.

Abstract: Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.

[551] Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels

Sining Chen, Yilei Shi, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: Proposes an ensemble-based pipeline for monocular height estimation that leverages imperfect out-of-domain labels through weak supervision, achieving significant improvements in cross-domain performance.

DetailsMotivation: Training deep neural networks for monocular height estimation requires abundant annotated data, but high-quality labels are scarce and typically limited to developed regions, which constrains model generalization and large-scale applicability.

Method: Introduces an ensemble-based pipeline compatible with any monocular height estimation network, featuring architecture and loss functions designed for weak supervision using balanced soft losses and ordinal constraints to leverage information from noisy labels.

Result: Experiments on DFC23 (0.5-1 m) and GBH (3 m) datasets show the method achieves more consistent cross-domain performance, reducing average RMSE by up to 22.94% on DFC23 and 18.62% on GBH compared to baselines.

Conclusion: The proposed approach effectively addresses the challenge of limited high-quality annotations by leveraging imperfect out-of-domain labels, enabling improved cross-domain generalization for monocular height estimation tasks.

Abstract: Monocular height estimation provides an efficient and cost-effective solution for three-dimensional perception in remote sensing. However, training deep neural networks for this task demands abundant annotated data, while high-quality labels are scarce and typically available only in developed regions, which limits model generalization and constrains their applicability at large scales. This work addresses the problem by leveraging imperfect labels from out-of-domain regions to train pixel-wise height estimation networks, which may be incomplete, inexact, or inaccurate compared to high-quality annotations. We introduce an ensemble-based pipeline compatible with any monocular height estimation network, featuring architecture and loss functions specifically designed to leverage information in noisy labels through weak supervision, utilizing balanced soft losses and ordinal constraints. Experiments on two datasets – DFC23 (0.5–1 m) and GBH (3 m) – show that our method achieves more consistent cross-domain performance, reducing average RMSE by up to 22.94% on DFC23 and 18.62% on GBH compared with baselines. Ablation studies confirm the contribution of each design component.

[552] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Shuhan Xu, Siyuan Liang, Hongling Zheng, Aishan Liu, Xinbiao Wang, Yong Luo, Fu Lin, Leszek Rutkowski, Dacheng Tao

Main category: cs.CV

TL;DR: SRD is a reinforcement learning defense framework that mitigates backdoor attacks in VLMs by applying discrete perturbations to confuse attention and disrupt malicious path activation, using semantic fidelity scores for policy optimization.

DetailsMotivation: VLMs are vulnerable to stealthy backdoor attacks during inference that trigger malicious captions, which are hard to detect due to cross-modal trigger propagation and lack of prior knowledge about attack patterns.

Method: Proposes Semantic Reward Defense (SRD) using deep Q-network policy to apply discrete perturbations to sensitive image regions, guided by semantic fidelity scores that assess caption consistency and fluency.

Result: Effectively mitigates TrojVLM and Shadowcast backdoor attacks, reducing ASR to 3.6% and 5.6% respectively with less than 15% average CIDEr drop on clean inputs.

Conclusion: SRD provides a trigger-agnostic, interpretable defense paradigm that successfully protects VLMs from backdoor attacks without requiring prior knowledge of attack patterns.

Abstract: Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.6% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs. Our codes can be found at https://github.com/Ciconey/SRD.git.

[553] Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

Ran Tong, Jiaqi Liu, Tong Wang, Xin Hu, Su Liu, Lanruo Wang, Jiexi Xu

Main category: cs.CV

TL;DR: Zero-shot medical VLM (BiomedCLIP) can match or outperform supervised CNNs in chest X-ray diagnosis after simple decision threshold calibration, achieving superior pneumonia detection and near-equal tuberculosis detection performance.

DetailsMotivation: To compare automated chest radiograph interpretation methods and determine if zero-shot vision-language models can compete with supervised CNNs in medical diagnosis tasks.

Method: Comparative analysis between supervised lightweight CNN and zero-shot BiomedCLIP VLM on pneumonia detection (PneumoniaMNIST) and tuberculosis detection (Shenzhen TB dataset), with decision threshold calibration applied to the VLM.

Result: After calibration, BiomedCLIP achieved F1-score of 0.8841 for pneumonia (surpassing CNN’s 0.8803) and 0.7684 for tuberculosis (close to CNN’s 0.7834), significantly improving from uncalibrated performance of 0.4812.

Conclusion: Proper calibration is essential for unlocking the full diagnostic potential of zero-shot VLMs, enabling them to compete with or outperform task-specific supervised models in medical imaging.

Abstract: The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN’s 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline’s 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

[554] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao, Yiming Bao, Xuezhen Tu, Bin Zhong, Linan Yue, Minling Zhang

Main category: cs.CV

TL;DR: APVR is a training-free framework that hierarchically retrieves visual information from long videos using pivot frame retrieval and pivot token retrieval to overcome memory limitations in multimodal large language models.

DetailsMotivation: Current MLLMs struggle with hour-level video understanding due to memory constraints and information overload. Existing training-free approaches compress visual features but lose important information, limiting performance.

Method: APVR uses two complementary components: Pivot Frame Retrieval with query expansion and iterative spatio-semantic confidence scoring to identify relevant frames, and Pivot Token Retrieval that performs query-aware attention-driven token selection within up to 1024 pivot frames.

Result: APVR achieves significant performance improvements of up to 9.5%, 4.6% and 9.7% on LongVideoBench, VideoMME and MLVU benchmarks respectively, and achieves state-of-the-art results for both training-free and training-based approaches.

Conclusion: APVR effectively addresses the memory wall limitation in long video understanding through hierarchical visual information retrieval, enabling processing of hour-long videos while maintaining semantic fidelity without requiring additional training.

Abstract: Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations on three different baseline MLLMs demonstrate significant performance improvements up to 9.5%, 4.6% and 9.7% on LongVideoBench, VideoMME and MLVU, respectively. APVR achieves state-of-the-art results for both training-free and training-based approaches.

[555] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology

Matthias Perkonigg, Patrick Rockenschaub, Georg Göbel, Adelheid Wöhrer

Main category: cs.CV

TL;DR: HGCD-BT is a novel hierarchical generalized category discovery method for brain tumor classification that identifies both known and unknown tumor types using hierarchical clustering with contrastive learning.

DetailsMotivation: Existing brain tumor classification methods are limited to predefined classes and cannot identify new tumor types, while unsupervised learning lacks labeled data integration and semi-supervised methods assume all classes are represented in labeled data.

Method: Integrates hierarchical clustering with contrastive learning, extending contrastive learning-based GCD with a novel semi-supervised hierarchical clustering loss to capture hierarchical tumor taxonomy structures.

Result: Achieves +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification on OpenSRH dataset, particularly in identifying unseen tumor categories, and demonstrates generalizability on slide-level classification across imaging modalities.

Conclusion: HGCD-BT effectively bridges the gap between supervised and unsupervised learning for brain tumor classification, enabling discovery of both known and unknown tumor types while capturing hierarchical relationships in tumor taxonomies.

Abstract: Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.

[556] MR-COSMO: Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation

Chade Li, Pengju Zhang, Yihong Wu

Main category: cs.CV

TL;DR: MR-COSMO is a novel method for text-query-guided 3D point cloud segmentation that addresses inadequate 3D-text alignment through direct cross-modal alignment and visual-text memory recall.

DetailsMotivation: Existing vision-language models underperform in point-level 3D segmentation due to poor 3D-text alignment that limits local feature-text context linking.

Method: Proposes MR-COSMO with two key components: (1) direct cross-modal alignment module for explicit 3D point cloud-text/2D image alignment, and (2) visual-text memory module with specialized feature banks for dynamic knowledge recall via attention mechanisms.

Result: Achieves state-of-the-art performance across 3D instruction, reference, and semantic segmentation benchmarks.

Conclusion: The proposed direct alignment and memory recall approach effectively bridges the 3D-text gap and enables precise fusion of geometric and semantic features for improved query-driven 3D segmentation.

Abstract: The rapid advancement of vision-language models (VLMs) in 3D domains has accelerated research in text-query-guided point cloud processing, though existing methods underperform in point-level segmentation due to inadequate 3D-text alignment that limits local feature-text context linking. To address this limitation, we propose MR-COSMO, a Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation, establishing explicit alignment between 3D point clouds and text/2D image data through a dedicated direct cross-modal alignment module while implementing a visual-text memory module with specialized feature banks. This direct alignment mechanism enables precise fusion of geometric and semantic features, while the memory module employs specialized banks storing text features, visual features, and their correspondence mappings to dynamically enhance scene-specific representations via attention-based knowledge recall. Comprehensive experiments across 3D instruction, reference, and semantic segmentation benchmarks confirm state-of-the-art performance.

[557] Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing

Colin Laganier, Liam Fletcher, Elim Kwan, Richard Walters, Victoria Nockles

Main category: cs.CV

TL;DR: Developed a novel YOLOv8 architecture optimized for SAR vessel detection on low-power FPGAs, achieving near state-of-the-art performance with 50-2500x better computational efficiency for on-satellite deployment.

DetailsMotivation: Enable rapid satellite imagery analysis by overcoming ground station latency through on-satellite ML, specifically for time-critical SAR vessel detection in maritime security applications.

Method: Systematic exploration of architectural adaptations to create a novel YOLOv8 architecture optimized for FPGA-based processing on Kria KV260 MPSoC hardware.

Result: Model analyzes ~700 megapixel SAR images in <1 minute using <10W power, with detection and classification performance only 2-3% lower than GPU-based models while being 50-2500x more computationally efficient.

Conclusion: This work enables on-satellite ML for time-critical SAR analysis and contributes to more autonomous, scalable satellite systems.

Abstract: Rapid analysis of satellite imagery within minutes-to-hours of acquisition is increasingly vital for many remote sensing applications, and is an essential component for developing next-generation autonomous and distributed satellite systems. On-satellite machine learning (ML) has the potential for such rapid analysis, by overcoming latency associated with intermittent satellite connectivity to ground stations or relay satellites, but state-of-the-art models are often too large or power-hungry for on-board deployment. Vessel detection using Synthetic Aperture Radar (SAR) is a critical time-sensitive application in maritime security that exemplifies this challenge. SAR vessel detection has previously been demonstrated only by ML models that either are too large for satellite deployment, have not been developed for sufficiently low-power hardware, or have only been tested on small SAR datasets that do not sufficiently represent the difficulty of the real-world task. Here we systematically explore a suite of architectural adaptations to develop a novel YOLOv8 architecture optimized for this task and FPGA-based processing. We deploy our model on a Kria KV260 MPSoC, and show it can analyze a ~700 megapixel SAR image in less than a minute, within common satellite power constraints (<10W). Our model has detection and classification performance only ~2% and 3% lower than values from state-of-the-art GPU-based models on the largest and most diverse open SAR vessel dataset, xView3-SAR, despite being ~50 and ~2500 times more computationally efficient. This work represents a key contribution towards on-satellite ML for time-critical SAR analysis, and more autonomous, scalable satellites.

[558] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA

Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, Junfeng Yang

Main category: cs.CV

TL;DR: CAPTCHA-X is introduced as the first real-world CAPTCHA benchmark with reasoning annotations, showing that step-by-step reasoning significantly improves solving accuracy from 21.9% to 83.9% for vision-language models.

DetailsMotivation: Current commercial vision-language models struggle with CAPTCHAs as high-difficulty spatial reasoning tasks, highlighting the need to study and improve their reasoning capabilities.

Method: Introduces CAPTCHA-X benchmark with 7 CAPTCHA categories and reasoning annotations, plus a VLM-based framework that incorporates step-by-step reasoning before generating final coordinates.

Result: The proposed method achieves 83.9% average solving accuracy across five high-difficulty CAPTCHA types, significantly surpassing the baseline of 21.9% from commercial models.

Conclusion: Step-by-step reasoning is crucial for solving spatial reasoning tasks like CAPTCHAs, revealing limitations in current models and highlighting the importance of reasoning for advancing visual-spatial challenges.

Abstract: CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.

[559] Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model

Zach Eidex, Mojtaba Safari, Tonghe Wang, Vanessa Wildman, David S. Yu, Hui Mao, Erik Middlebrooks, Aparna Kesarwala, Xiaofeng Yang

Main category: cs.CV

TL;DR: 7T-Restormer synthesizes 7T-quality T1 maps from 1.5T/3T T1-weighted MRI using transformer architecture, outperforming existing methods with fewer parameters.

DetailsMotivation: 7T MRI offers superior resolution but is expensive and scarce with susceptibility artifacts. Need accessible alternative for clinical workflows.

Method: Transformer-based model trained on 35 1.5T and 108 3T T1w MRI paired with 7T T1 maps from MS patients, using mixed field strength training strategy.

Result: Achieved PSNR 26.0 dB, SSIM 0.861, NMSE 0.019 for 1.5T; 64% NMSE reduction vs ResShift, 41% vs ResViT with only 10.5M parameters. Mixed training outperformed single-field strategies.

Conclusion: Novel method successfully predicts 7T MP2RAGE maps from standard clinical scanners, making 7T benefits more accessible to routine clinical practice.

Abstract: Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.

[560] Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces

Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang

Main category: cs.CV

TL;DR: MODELNAME improves implicit-knowledge VQA by using dual-path structured reasoning traces instead of answer-only supervision, achieving better accuracy and transparency without external knowledge sources.

DetailsMotivation: Existing IK-KVQA approaches suffer from weak reasoning, inconsistent justifications, and brittle generalization due to answer-only supervision during training.

Method: Uses dual-path structured reasoning traces (symbolic relation paths over text/vision with natural-language explanations) to provide stronger inductive bias, builds trace-enriched dataset via structure-aware self-distillation with a single MLLM.

Result: Achieves up to 11.3% higher answer accuracy on OK-VQA over strongest baseline, with improved transparency of intermediate reasoning.

Conclusion: Structured reasoning traces provide effective modality-aware scaffolds for IK-KVQA, enabling better accuracy and interpretability without external knowledge resources.

Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose MODELNAME, a framework that equips IK-KVQA with dual-path structured reasoning traces (symbolic relation paths over text and vision together with path-grounded natural-language explanations) to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. Using a single open-source MLLM, MODELNAME constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, MODELNAME consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to 11.3% higher answer accuracy on OK-VQA over the strongest baseline.

[561] ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel

Main category: cs.CV

TL;DR: ThinkingViT is a nested Vision Transformer architecture that dynamically adjusts computation based on input difficulty using progressive thinking stages and early termination when confidence thresholds are met.

DetailsMotivation: Current nested Transformer architectures allocate the same compute to all inputs regardless of complexity, leading to inefficiencies. ThinkingViT addresses this by enabling adaptive computation based on input difficulty.

Method: ThinkingViT uses progressive thinking stages that start with a small subset of attention heads, then iteratively activates larger subsets if confidence thresholds aren’t met. It employs Token Recycling to fuse previous embeddings with current inputs for better subsequent rounds.

Result: ThinkingViT outperforms nested baselines by up to 2.0 p.p. in accuracy at same throughput and up to 2.9 p.p. at equal GMACs on ImageNet-1K. It also transfers effectively to other architectures like Swin and serves as plug-in upgrade for downstream tasks.

Conclusion: ThinkingViT provides an efficient backbone-preserving design for adaptive computation in Vision Transformers, enabling scalable deployment across heterogeneous hardware while maintaining performance across various tasks and architectures.

Abstract: ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin. The source code is available at https://github.com/ds-kiel/ThinkingViT.

[562] SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu

Main category: cs.CV

TL;DR: SCALAR is a controllable image generation method for Visual Autoregressive models that introduces scale-wise conditional decoding to address inefficient control encoding and disruptive injection issues in existing VAR-based approaches.

DetailsMotivation: Controllable generation remains challenging for Visual Autoregressive models due to their hierarchical next-scale prediction style, with existing methods suffering from inefficient control encoding and disruptive injection mechanisms that compromise fidelity and efficiency.

Method: SCALAR uses a pretrained image encoder to extract semantic control signal encodings, projects them into scale-specific representations, and injects them into corresponding layers of the VAR backbone through a Scale-wise Conditional Decoding mechanism. SCALAR-Uni extends this to align multiple control modalities in a shared latent space.

Result: Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks compared to existing methods.

Conclusion: SCALAR provides persistent and structurally aligned guidance throughout the generation process, enabling efficient and precise controllable image synthesis with VAR models, with SCALAR-Uni supporting flexible multi-conditional guidance in a single model.

Abstract: Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks. The code is released at https://github.com/AMAP-ML/SCALAR.

[563] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla, Matteo Pennisi, Sarinda Samarasinghe, Giovanni Bellitto, Simone Palazzo, Daniela Giordano, Mubarak Shah, Concetto Spampinato

Main category: cs.CV

TL;DR: DEXTER is a data-free framework that uses diffusion models and LLMs to generate global, textual explanations of visual classifiers by optimizing text prompts to create class-conditional images that activate the classifier, then producing natural language reports about decision patterns and biases.

DetailsMotivation: To build transparent and trustworthy AI systems by understanding and explaining machine learning model behavior, especially for visual classifiers, without needing access to training data or ground-truth labels.

Method: Uses diffusion models and large language models to optimize text prompts that synthesize class-conditional images strongly activating target classifiers, then generates natural language explanations from these synthetic samples to describe decision patterns and biases.

Result: Outperforms existing approaches in global model explanation and class-level bias reporting on datasets including ImageNet, Waterbirds, CelebA, and FairFaces. Produces accurate, interpretable outputs validated through quantitative/qualitative evaluations and user study.

Conclusion: DEXTER provides an effective data-free framework for generating natural language explanations of visual classifiers’ decision processes, enabling better model transparency and bias understanding without requiring original training data.

Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier’s decision process without access to training data or ground-truth labels. We demonstrate DEXTER’s flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

[564] IMC-Net: A Lightweight Content-Conditioned Encoder with Multi-Pass Processing for Image Classification

YiZhou Li

Main category: cs.CV

TL;DR: A compact image categorization encoder using content-conditioned multi-pass processing with a lightweight core block and score-based selector for input-dependent depth, achieving competitive accuracy with reduced parameters and faster inference.

DetailsMotivation: To create a computation-efficient image categorization model that provides input-conditioned depth without heavy auxiliary modules or specialized pretraining, focusing on economy through multi-pass processing.

Method: Uses a single lightweight core block that can be re-applied multiple times with a score-based selector determining whether further passes benefit each feature map region, implementing module reuse and mild regularization on selection scores.

Result: Attains competitive accuracy on standard benchmarks with reduced parameters, lower floating-point operations, and faster inference compared to similarly sized baselines, while transferring well to multiple datasets without task-specific customization.

Conclusion: The multi-pass strategy with content-conditioned processing provides an effective approach for computation-efficient image categorization while maintaining competitive performance and architectural minimalism.

Abstract: We present a compact encoder for image categorization that emphasizes computation economy through content-conditioned multi-pass processing. The model employs a single lightweight core block that can be re-applied a small number of times, while a simple score-based selector decides whether further passes are beneficial for each region unit in the feature map. This design provides input-conditioned depth without introducing heavy auxiliary modules or specialized pretraining. On standard benchmarks, the approach attains competitive accuracy with reduced parameters, lower floating-point operations, and faster inference compared to similarly sized baselines. The method keeps the architecture minimal, implements module reuse to control footprint, and preserves stable training via mild regularization on selection scores. We discuss implementation choices for efficient masking, pass control, and representation caching, and show that the multi-pass strategy transfers well to several datasets without requiring task-specific customization.

[565] SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries

Chenxu Dang, Haiyan Liu, Jason Bao, Pei An, Xinyue Tang, PanAn, Jie Ma, Bingchuan Sun, Yan Wang

Main category: cs.CV

TL;DR: SparseWorld is a novel 4D occupancy world model that uses sparse dynamic queries instead of static embeddings, enabling flexible and adaptive perception and forecasting in autonomous driving scenarios.

DetailsMotivation: Existing occupancy world models use static embeddings/grids that limit perception flexibility and misalign with dynamic real-world scenarios due to their 'in-place classification' approach.

Method: Proposes Range-Adaptive Perception module with learnable queries modulated by ego vehicle states, State-Conditioned Forecasting module using regression instead of classification, and Temporal-Aware Self-Scheduling training strategy.

Result: Achieves state-of-the-art performance in perception, forecasting, and planning tasks, with advantages in flexibility, adaptability, and efficiency demonstrated through extensive experiments and visualizations.

Conclusion: SparseWorld successfully addresses limitations of static occupancy models by introducing dynamic query-based approach that better aligns with continuous 4D environments, enabling more effective world modeling for autonomous systems.

Abstract: Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios. In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency.

[566] DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction

Yuchen Zhou, Yan Luo, Xiaogang Wang, Xingjian Gu, Mingzhou Lu

Main category: cs.CV

TL;DR: A pure 2D framework for efficient 3D occupancy prediction that introduces height-score projection and direction-aware convolution to preserve geometric integrity while maintaining real-time performance.

DetailsMotivation: Existing methods trade off accuracy for efficiency - some are accurate but slow, while BEV-based approaches sacrifice vertical cues and geometric integrity for speed.

Method: Proposes a pure 2D framework with height-score projection to encode vertical structure, and direction-aware convolution to extract geometric features along vertical and horizontal orientations.

Result: Achieves 39.3% mIoU on Occ3D-nuScenes with 27.7 FPS inference speed, and 14.8 FPS on edge devices, demonstrating real-time capability.

Conclusion: The method effectively balances accuracy and efficiency, making it suitable for real-time deployment in resource-constrained autonomous driving systems.

Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many existing methods involve trade-offs between accuracy and efficiency. Some achieve high precision but with slow inference speed, while others adopt purely bird’s-eye-view (BEV)-based 2D representations to accelerate processing, inevitably sacrificing vertical cues and compromising geometric integrity. To overcome these limitations, we propose a pure 2D framework that achieves efficient 3D occupancy prediction while preserving geometric integrity. Unlike conventional Lift-Splat-Shoot (LSS) methods that rely solely on depth scores to lift 2D features into 3D space, our approach additionally introduces a height-score projection to encode vertical geometric structure. We further employ direction-aware convolution to extract geometric features along both vertical and horizontal orientations, effectively balancing accuracy and computational efficiency. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

[567] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Liang An, Jin Lyu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang

Main category: cs.CV

TL;DR: AniMer+ is a unified framework for reconstructing mammal and bird pose/shape using a high-capacity family-aware ViT with Mixture-of-Experts design, trained on combined real and synthetic data including the first large-scale 3D bird dataset.

DetailsMotivation: To achieve unified understanding of dynamic objects through a single network and enable accurate animal pose/shape estimation across diverse species for biological research, addressing limitations of previous methods' network capacity and multi-species dataset scarcity.

Method: Developed AniMer+ with family-aware Vision Transformer using Mixture-of-Experts design that partitions layers into taxa-specific and shared components. Created diffusion-based synthetic datasets (CtrlAni3D for quadrupeds, CtrlAVES3D for birds) to overcome 3D data shortage.

Result: Superior performance over existing approaches across benchmarks including challenging Animal Kingdom dataset, trained on 41.3k mammalian and 12.4k avian images (real + synthetic). Ablation studies confirm effectiveness of architecture and synthetic data.

Conclusion: AniMer+ successfully enables unified reconstruction of mammals and birds through innovative network design and synthetic data generation, demonstrating strong performance on diverse benchmarks and real-world applications.

Abstract: In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

[568] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu

Main category: cs.CV

TL;DR: NS-Net is a novel AI-generated image detection framework that uses NULL-Space projection to remove semantic information from CLIP features and contrastive learning to capture distributional differences between real and fake images, achieving 7.4% better accuracy than state-of-the-art methods.

DetailsMotivation: Existing AI-generated image detectors fail to generalize to unknown generative models, especially when real and fake images have similar semantic content, due to the high-level semantic information in CLIP features hindering effective discrimination.

Method: Proposed NS-Net framework with NULL-Space projection to decouple semantic information from CLIP’s visual features, contrastive learning to capture intrinsic distributional differences, and Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias from global image structures.

Result: Extensive experiments on 40 diverse generative models show NS-Net outperforms state-of-the-art methods with 7.4% improvement in detection accuracy, demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

Conclusion: NS-Net effectively addresses the generalization problem in AI-generated image detection by removing semantic bias from CLIP features and capturing intrinsic distributional differences, achieving superior performance across diverse generative models.

Abstract: The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP’s visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP’s visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

[569] VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma

Main category: cs.CV

TL;DR: CLIP-IN enhances CLIP’s fine-grained visual understanding through instruction-editing datasets as hard negative pairs and long descriptive captions with rotary encodings, improving performance on fine-grained tasks while maintaining zero-shot capabilities.

DetailsMotivation: Vision-Language Models like CLIP struggle with detailed, fine-grained visual comprehension despite their success in aligning vision and language.

Method: Uses instruction-editing datasets as hard negative image-text pairs with symmetric contrastive loss, and incorporates long descriptive captions with rotary positional encodings to capture rich semantic context.

Result: Achieves substantial gains on MMVP benchmark and fine-grained visual recognition tasks without compromising zero-shot performance, and reduces visual hallucinations in Multimodal Large Language Models.

Conclusion: Synergizing targeted instruction-based contrastive learning with comprehensive descriptive information significantly elevates fine-grained understanding in Vision-Language Models.

Abstract: Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP’s fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN’s visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

[570] MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

Shuo Wang, Yongcai Wang, Zhaoxin Fan, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Wanting Li, Xudong Cai, Yeying Jin, Deying Li

Main category: cs.CV

TL;DR: MonoDream enables monocular VLN agents to learn a unified navigation representation that aligns visual semantics with language-grounded action intent, narrowing the performance gap with panoramic RGB-D methods.

DetailsMotivation: Panoramic RGB-D sensors are costly and less accessible for real-world VLN deployments, while existing monocular approaches still lag behind panoramic methods.

Method: Proposes MonoDream framework with Unified Navigation Representation (UNR) and Latent Panoramic Dreaming (LPD) tasks that predict latent features of panoramic RGB-D observations from monocular input.

Result: Consistently improves monocular navigation performance across multiple VLN benchmarks and significantly narrows the gap with panoramic-based agents.

Conclusion: MonoDream demonstrates that lightweight VLA frameworks can effectively enable monocular agents to achieve competitive navigation performance without requiring panoramic RGB-D sensors.

Abstract: Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

[571] MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

Daisheng Jin, Ying He

Main category: cs.CV

TL;DR: MonoCloth reconstructs and animates clothed human avatars from monocular videos using part-based decomposition and cloth simulation for improved realism.

DetailsMotivation: Monocular video reconstruction of human avatars is challenging due to limited geometric information and complex non-rigid motion, requiring specialized approaches for different body components.

Method: Part-based decomposition separates avatar into body, face, hands, and clothing components, with dedicated cloth simulation module using temporal motion cues and geometric constraints for garment deformation.

Result: MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods, and supports additional tasks like clothing transfer.

Conclusion: The part-based design enables versatile and practical avatar reconstruction from monocular videos with enhanced realism and additional functionality.

Abstract: Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

[572] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang

Main category: cs.CV

TL;DR: MMEdge is a real-time multimodal inference framework for edge devices that uses pipelined sensing and encoding to reduce latency while maintaining accuracy through temporal aggregation and adaptive optimization.

DetailsMotivation: Real-time multimodal inference on resource-constrained edge devices is essential for applications like autonomous driving and human-computer interaction, but prior work overlooks the coupling between sensing dynamics and model execution, as well as inter-modality dependencies.

Method: MMEdge decomposes inference into fine-grained sensing/encoding units for incremental computation, uses temporal aggregation to capture dynamics, incorporates adaptive configuration optimizer for resource constraints, and employs cross-modal speculative skipping for early decision-making.

Result: Evaluation on multimodal datasets and UAV testbed shows MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

Conclusion: MMEdge provides an effective pipelined approach for real-time multimodal inference on edge devices, achieving both low latency and high accuracy through its incremental processing and adaptive optimization techniques.

Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

[573] X-MoGen: Unified Motion Generation across Humans and Animals

Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su, Gaoang Wang

Main category: cs.CV

TL;DR: X-MoGen is a unified framework for cross-species text-driven motion generation that handles both humans and animals using a two-stage architecture with morphological consistency constraints.

DetailsMotivation: Existing methods model human and animal motion separately, but a joint cross-species approach offers advantages like unified representation and improved generalization, though morphological differences remain challenging.

Method: Two-stage architecture: 1) Conditional graph VAE learns canonical T-pose priors, autoencoder encodes motion into shared latent space with morphological loss; 2) Masked motion modeling generates motion embeddings conditioned on text descriptions with morphological consistency module.

Result: Outperforms state-of-the-art methods on both seen and unseen species, as demonstrated through extensive experiments on the UniMo4D dataset.

Conclusion: X-MoGen successfully addresses cross-species motion generation challenges and provides a unified framework that generalizes well across different species.

Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

[574] MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

Can Zhao, Pengfei Guo, Dong Yang, Yucheng Tang, Yufan He, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu

Main category: cs.CV

TL;DR: MAISI-v2 is an accelerated 3D medical image synthesis framework that uses rectified flow for fast generation and introduces region-specific contrastive loss for better condition fidelity, achieving 33x speedup and state-of-the-art image quality.

DetailsMotivation: Existing medical image synthesis methods have limited generalizability across body regions, slow inference times, and weak alignment with input conditions, which are critical issues for clinical applications.

Method: Integrates rectified flow for accelerated generation and introduces a novel region-specific contrastive loss to enhance sensitivity to regions of interest, improving condition fidelity.

Result: Achieves state-of-the-art image quality with 33x acceleration for latent diffusion models, and synthetic images can be effectively used for data augmentation in downstream segmentation tasks.

Conclusion: MAISI-v2 provides a fast, high-quality medical image synthesis solution with improved condition consistency, and the released code and resources will facilitate further development in the community.

Abstract: Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with $33 \times$ acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.

[575] Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

Doan-Van-Anh Ly, Thi-Thu-Hien Pham, Thanh-Hai Le

Main category: cs.CV

TL;DR: UNet-based architectures with ResNet backbones outperform Transformer and Mamba alternatives for liver tumor segmentation in CECT, with ResNetUNet3+ + CBAM achieving best performance metrics including Dice 0.755 and IoU 0.662.

DetailsMotivation: Liver structure segmentation in multi-phase CECT is crucial for computer-aided diagnosis and treatment planning of liver diseases including tumor detection.

Method: Evaluated UNet-based architectures with various backbones (ResNet, Transformer, Mamba) pretrained, then incorporated attention mechanisms including CBAM, and used Grad-CAM for interpretability.

Result: ResNet-based models consistently outperformed alternatives. ResNetUNet3+ with CBAM achieved best metrics: Dice 0.755, IoU 0.662, HD95 77.911, accuracy 0.925, specificity 0.926.

Conclusion: Classical ResNet architecture combined with modern attention modules remains highly competitive for medical image segmentation, offering promising direction for liver tumor detection in clinical practice.

Abstract: Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model’s superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region’s most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

[576] SynSeg: Feature Synergy for Multi-Category Contrastive Learning in End-to-End Open-Vocabulary Semantic Segmentation

Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu

Main category: cs.CV

TL;DR: SynSeg is a novel weakly-supervised semantic segmentation method that uses Multi-Category Contrastive Learning and Feature Synergy Structure to address semantic misalignment in open-vocabulary scenarios, achieving state-of-the-art performance with 6.9-26.2% higher accuracy.

DetailsMotivation: Existing weakly-supervised methods suffer from semantic misalignment and poor performance due to category-specific supervision and ill-suited feature construction for contrastive learning in open-vocabulary semantic segmentation.

Method: Proposes Multi-Category Contrastive Learning (MCCL) for intra- and inter-category alignment and separation, combined with Feature Synergy Structure (FSS) for discriminative feature reconstruction through prior fusion and semantic-activation-map enhancement.

Result: Extensive experiments show SynSeg outperforms state-of-the-art methods with 6.9-26.2% higher accuracy, improving semantic localization and discrimination abilities under weak supervision while enabling real-time inference.

Conclusion: SynSeg provides an effective lightweight end-to-end solution for open-vocabulary semantic segmentation that avoids foreground bias and achieves superior performance without relying on large-scale pretrained models.

Abstract: Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. Furthermore, SynSeg is a lightweight end-to-end solution without using any mid-term output from large-scale pretrained models and capable for real-time inference. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision in an efficient manner. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. Particularly, SynSeg achieves higher accuracy than SOTA baselines with a ratio from 6.9% up to 26.2%.

[577] LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao, Ping Chen, Fuyuan Shi, Chao Tan, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: LeMiCa is a training-free acceleration framework for diffusion-based video generation that uses lexicographic minimax path optimization to bound global errors, achieving 2.9x speedup on Latte model with minimal quality degradation.

DetailsMotivation: Existing caching strategies for video generation focus on reducing local errors but overlook global error accumulation, leading to content degradation between accelerated and original videos.

Method: Formulates cache scheduling as a directed graph with error-weighted edges and introduces Lexicographic Minimax Path Optimization to explicitly bound worst-case path error, improving global content and style consistency.

Result: Achieves 2.9x speedup on Latte model and LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques with minimal perceptual quality degradation.

Conclusion: LeMiCa provides a robust and generalizable paradigm for accelerating diffusion-based video generation while maintaining quality, serving as a foundation for future efficient video synthesis research.

Abstract: We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

[578] Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

Main category: cs.CV

TL;DR: GenD achieves state-of-the-art deepfake detection generalization by fine-tuning only 0.03% of Layer Normalization parameters in pre-trained vision encoders, using L2 normalization and metric learning on a hyperspherical feature manifold.

DetailsMotivation: Deepfake detectors struggle with generalization to unseen manipulation techniques, and existing approaches often introduce significant architectural complexity without solving the core generalization problem.

Method: Parameter-efficient adaptation of pre-trained vision encoders by fine-tuning only Layer Normalization parameters, enforcing hyperspherical feature manifold using L2 normalization and metric learning.

Result: Achieves state-of-the-art performance across 14 benchmark datasets (2019-2025), outperforming more complex approaches in average cross-dataset AUROC. Key findings: training on paired real-fake data is essential, and detection difficulty hasn’t strictly increased over time.

Conclusion: State-of-the-art generalization in deepfake detection is achievable through minimal, targeted changes to pre-trained foundational models, providing a computationally efficient and reproducible solution.

Abstract: The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD

[579] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms

Shengyu Tang, Zeyuan Lu, Jiazhi Dong, Changdong Yu, Xiaoyu Wang, Yaohui Lyu, Weihao Xia

Main category: cs.CV

TL;DR: DMSORT is an efficient maritime multi-object tracking method that uses parallel tracking with affine compensation, combining object detection/ReID with camera motion estimation to handle challenging marine environments.

DetailsMotivation: Maritime MOT faces challenges from camera motion and visual degradation in complex marine environments, which affect tracking accuracy and vessel navigation safety.

Method: Uses dual-branch parallel tracker with affine compensation: detection/ReID branch with RCDN for object detection and Li-TAE for appearance features, plus camera motion estimation branch that decouples platform and target motion using projective transformation and Kalman filter compensation.

Result: Achieves state-of-the-art performance on Singapore Maritime Dataset with fastest runtime among ReID-based MOT frameworks, maintaining high identity consistency and robustness to jitter/occlusion.

Conclusion: DMSORT effectively addresses maritime MOT challenges through its dual-branch architecture and motion compensation, providing efficient and robust tracking suitable for marine applications.

Abstract: Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.

[580] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

Junsheng Huang, Shengyu Hao, Bocheng Hu, Hongwei Wang, Gaoang Wang

Main category: cs.CV

TL;DR: EgoDynamic4D is a new QA benchmark for dynamic 4D scene understanding from egocentric views, featuring 927K QA pairs with Chain-of-Thought annotations and 12 reasoning tasks, with a proposed framework that outperforms baselines.

DetailsMotivation: Existing egocentric datasets lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning about object/human motion and interactions in dynamic scenes.

Method: Proposed end-to-end spatio-temporal reasoning framework with instance-aware feature encoding, time/camera encoding, and spatially adaptive down-sampling to compress 4D scenes into LLM-manageable token sequences.

Result: Experiments show the method consistently outperforms baselines on the EgoDynamic4D benchmark, validating effectiveness of multimodal temporal modeling.

Conclusion: The work addresses the gap in dynamic 4D scene understanding through a comprehensive benchmark and effective reasoning framework, advancing egocentric scene analysis capabilities.

Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

[581] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

Main category: cs.CV

TL;DR: Proto-LeakNet is an interpretable attribution framework that detects diffusion model fingerprints in latent representations using signal-leak analysis and prototype-based classification.

DetailsMotivation: The increasing sophistication of synthetic image and deepfake generation models makes source attribution and authenticity verification crucial for computer vision systems, as diffusion models unintentionally leave statistical traces in their outputs.

Method: The framework operates in the latent domain of diffusion models, re-simulating partial forward diffusion to expose generator-specific cues. It uses a temporal attention encoder for multi-step latent feature aggregation and a feature-weighted prototype head for transparent attribution with closed-set classification and open-set evaluation.

Result: Proto-LeakNet achieves 98.13% Macro AUC, learns robust latent geometry that withstands post-processing, outperforms state-of-the-art methods, and shows strong separability between real images, known generators, and unseen generators.

Conclusion: The proposed framework effectively addresses source attribution challenges for diffusion models by leveraging signal-leak analysis and interpretable prototype-based learning, providing robust performance without requiring retraining for unseen generators.

Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase will be available after acceptance.

[582] The Brain Resection Multimodal Image Registration (ReMIND2Reg) 2025 Challenge

Reuben Dorent, Laura Rigolo, Colin P. Galvin, Junyu Chen, Mattias P. Heinrich, Aaron Carass, Olivier Colliot, Demian Wassermann, Alexandra Golby, Tina Kapur, William Wells

Main category: cs.CV

TL;DR: The ReMIND2Reg 2025 Challenge provides a large public benchmark for aligning post-resection intraoperative ultrasound with preoperative MRI to address brain shift in neurosurgery, featuring 114 cases and standardized evaluation metrics.

DetailsMotivation: Neuronavigation systems lose accuracy during brain tumor surgery due to brain shift, making accurate intraoperative image guidance critical for maximal safe resection.

Method: The challenge provides paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes from 114 cases (99 training, 5 validation, 10 test) without annotations for training, with validation and test evaluated on manually annotated landmarks.

Result: Performance is evaluated using target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime metrics.

Conclusion: The ReMIND2Reg benchmark aims to accelerate development of robust, generalizable multimodal registration algorithms for clinically deployable image-guided neurosurgery.

Abstract: Accurate intraoperative image guidance is critical for achieving maximal safe resection in brain tumor surgery, yet neuronavigation systems based on preoperative MRI lose accuracy during the procedure due to brain shift. Aligning post-resection intraoperative ultrasound (iUS) with preoperative MRI can restore spatial accuracy by estimating brain shift deformations, but it remains a challenging problem given the large anatomical and topological changes and substantial modality intensity gap. The ReMIND2Reg 2025 Challenge provides the largest public benchmark for this task, built upon the ReMIND dataset. It offers 99 training cases, 5 validation cases, and 10 private test cases comprising paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes. Data are provided without annotations for training, while validation and test performance are evaluated on manually annotated anatomical landmarks. Metrics include target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime. By establishing a standardized evaluation framework for this clinically critical and technically complex problem, ReMIND2Reg aims to accelerate the development of robust, generalizable, and clinically deployable multimodal registration algorithms for image-guided neurosurgery.

[583] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

Main category: cs.CV

TL;DR: ViMoNet is a framework that combines video and motion data to better understand human behavior, outperforming existing methods in caption generation and behavior interpretation.

DetailsMotivation: Current models focus only on motion data or videos, but combining both is essential to fully capture the nuances of human actions and movements.

Method: ViMoNet uses joint training with both detailed motion-text data and comprehensive video-text data, and introduces the VIMOS dataset with films, motion sequences, instructions, and subtitles.

Result: ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation, as shown through tests on the ViMoNet-Bench benchmark.

Conclusion: Combining video and motion data through joint training enables better understanding of human behavior, and ViMoNet provides an effective framework for this purpose.

Abstract: This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model’s acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.

[584] LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang

Main category: cs.CV

TL;DR: LLMC+ is a comprehensive benchmark and toolkit for compressing Large Vision-Language Models that addresses limitations in current training-free compression methods by enabling systematic evaluation of spatial and temporal redundancy reduction techniques.

DetailsMotivation: Current VLM compression methods suffer from three major limitations: lack of comparable module decomposition, evaluation confined to simple single-turn tasks, and isolated use of individual compression techniques without exploring joint potential.

Method: Introduces LLMC+ benchmark with over 20 algorithms across five VLM families, supporting systematic study of token-level and model-level compression through a plug-and-play toolkit.

Result: Reveals that spatial and temporal redundancies require distinct strategies, token reduction methods degrade in multi-turn dialogue and detail-sensitive tasks, and combining token and model compression achieves extreme compression with minimal performance loss.

Conclusion: LLMC+ facilitates fair evaluation and inspires future research in efficient VLM compression, providing a comprehensive framework for studying compression techniques.

Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

[585] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection

Yangjie Xiao, Ke Zhang, Jiacun Wang, Xin Sheng, Yurong Guo, Meijuan Chen, Zehua Ren, Zhaoye Zheng, Zhenbing Zhao

Main category: cs.CV

TL;DR: SBDE is a segmentation-driven bolt defect editing method that generates realistic defect images to augment imbalanced datasets for improved bolt defect detection in transmission lines.

DetailsMotivation: Bolt defect detection suffers from scarcity of defect images and imbalanced data distributions, which limits detection performance in transmission line safety applications.

Method: Proposes Bolt-SAM for bolt attribute segmentation with CLAHE-FFT Adapter and Multipart-Aware Mask Decoder, then uses MOD-LaMa for defect attribute editing, and ERA strategy to place edited defects back into original scenes.

Result: Generated bolt defect images significantly outperform state-of-the-art image editing models and effectively improve bolt defect detection performance in extensive experiments.

Conclusion: SBDE method proves effective for dataset augmentation and has strong application potential for improving bolt defect detection in transmission line safety systems.

Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.

[586] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang

Main category: cs.CV

TL;DR: HumanSense is a comprehensive benchmark for evaluating MLLMs’ human-centered perception and interaction capabilities, focusing on understanding multimodal contexts and providing rational feedback. The study shows current MLLMs need improvement, especially for interaction tasks, and proposes a multi-stage reinforcement learning approach that enhances performance.

DetailsMotivation: Progress in MLLMs is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios that assess understanding of complex human intentions and provision of empathetic, context-aware responses.

Method: Introduces HumanSense benchmark and develops a multi-stage, modality-progressive reinforcement learning approach (HumanSense-Omni-Reasoning) to enhance reasoning abilities. Also designs prompts to improve non-reasoning models training-free.

Result: Leading MLLMs show considerable room for improvement, especially for interaction tasks. Supplementing visual with audio/text improves performance, and omni-modal models have advantages. The proposed approach substantially enhances higher-level understanding and interactive tasks.

Conclusion: Reasoning ability is key to providing appropriate feedback, and consistent thought patterns in reasoning processes can be leveraged through prompt design to enhance model performance without additional training.

Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor’s needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.Project page: \textcolor{brightpink}{https://digital-avatar.github.io/ai/HumanSense/}

[587] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition

Md Asgor Hossain Reaj, Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Md Jawadul Hasan, Tze Hui Liew

Main category: cs.CV

TL;DR: GANDiff FR is a synthetic framework that combines StyleGAN3 and diffusion models to generate demographically balanced faces with controlled attributes for bias measurement and reduction in face recognition systems.

DetailsMotivation: To create a reproducible framework for precisely measuring and reducing bias in face recognition systems by controlling demographic and environmental factors under ceteris paribus conditions.

Method: Unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control to manipulate pose (30°), illumination (4 directions), and expression (5 levels). Synthesizes 10,000 demographically balanced faces across five cohorts.

Result: AdaFace reduces inter-group TPR disparity by 60% (2.5% vs 6.3%), with illumination accounting for 42% of residual bias. Strong synthetic-to-real transfer (r=0.85) confirmed on RFW, BUPT, and CASIA WebFace datasets.

Conclusion: GANDiff FR establishes a reproducible standard for fairness auditing with 3x more attribute-conditioned variants than pure GANs, despite 20% computational overhead, supporting transparent and scalable bias evaluation aligned with EU AI Act regulations.

Abstract: We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.

[588] Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi, Quanyu Long, Yin Wu, Wenya Wang

Main category: cs.CV

TL;DR: VideoLMs don’t rely on positional encodings for temporal understanding; instead, temporal reasoning emerges from inter-frame attention and causal mechanisms. The study reveals temporal cues are synthesized through attention and proposes efficient strategies like staged cross-modal attention and temporal exit mechanisms.

DetailsMotivation: To understand how VideoLMs actually perform temporal reasoning, given that positional encodings surprisingly have minimal impact on temporal understanding performance.

Method: Conducted analysis experiments to trace temporal information integration, revealing a causal pathway where temporal cues are synthesized through inter-frame attention and aggregated in the final frame. Proposed staged cross-modal attention and temporal exit mechanisms for efficiency.

Result: Found that removing/modifying positional encodings causes minimal degradation, while reversing frame sequence with original PEs causes substantial performance drop. Temporal reasoning emerges from inter-visual token interactions under causal attention constraints.

Conclusion: Temporal understanding in VideoLMs emerges from implicit temporal structure in causal attention mechanisms rather than explicit positional encodings, enabling more efficient model designs through the proposed strategies.

Abstract: Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first systematic study of video temporal understanding in VideoLMs, offering insights for future model improvement. Our code is available at https://github.com/ANDgate99/Causality-Matters .

[589] S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing

Liang Lv, Di Wang, Jing Zhang, Lefei Zhang

Main category: cs.CV

TL;DR: S5 is a scalable semi-supervised semantic segmentation framework for remote sensing that leverages large-scale unlabeled data through data selection strategies and foundation model pre-training, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Existing semi-supervised semantic segmentation methods in remote sensing rely on small datasets and models, limiting practical applicability. There's vast unlabeled Earth observation data that remains underutilized due to expensive pixel-level annotations.

Method: Proposed S5 framework with: 1) Data selection strategy combining entropy-based filtering and diversity expansion to create RS4P-1M dataset; 2) Pre-training RS foundation models of varying sizes on this dataset; 3) Mixture-of-Experts-based multi-dataset fine-tuning for efficient adaptation to multiple benchmarks.

Result: The framework significantly boosts performance on land cover segmentation and object detection tasks. RSFMs achieve state-of-the-art performance across all benchmarks, demonstrating the viability of scaling semi-supervised learning for remote sensing applications.

Conclusion: S5 successfully unlocks the potential of vast unlabeled Earth observation data through scalable semi-supervised learning, providing a practical solution for remote sensing analysis with improved generalization and versatility across diverse benchmarks.

Abstract: Semi-supervised semantic segmentation (S4) has advanced remote sensing (RS) analysis by leveraging unlabeled data through pseudo-labeling and consistency learning. However, existing S4 studies often rely on small-scale datasets and models, limiting their practical applicability. To address this, we propose S5, the first scalable framework for semi-supervised semantic segmentation in RS, which unlocks the potential of vast unlabeled Earth observation data typically underutilized due to costly pixel-level annotations. Built upon existing large-scale RS datasets, S5 introduces a data selection strategy that integrates entropy-based filtering and diversity expansion, resulting in the RS4P-1M dataset. Using this dataset, we systematically scale up S4 methods by pre-training RS foundation models (RSFMs) of varying sizes on this extensive corpus, significantly boosting their performance on land cover segmentation and object detection tasks. Furthermore, during fine-tuning, we incorporate a Mixture-of-Experts (MoE)-based multi-dataset fine-tuning approach, which enables efficient adaptation to multiple RS benchmarks with fewer parameters. This approach improves the generalization and versatility of RSFMs across diverse RS benchmarks. The resulting RSFMs achieve state-of-the-art performance across all benchmarks, underscoring the viability of scaling semi-supervised learning for RS applications. All datasets, code, and models will be released at https://github.com/MiliLab/S5

[590] Evaluating Multiple Instance Learning Strategies for Automated Sebocyte Droplet Counting

Maryam Adelipour, Gustavo Carneiro, Jeongkwon Kim

Main category: cs.CV

TL;DR: Simple attention-based MIL framework for automated sebocyte lipid droplet counting, where baseline MLP outperformed attention-based MIL in stability.

DetailsMotivation: Manual counting of sebocyte lipid droplets is labor-intensive and subjective, motivating automated solutions for sebocyte image analysis.

Method: Used Nile Red-stained sebocyte images annotated into 14 classes by droplet count, benchmarked baseline MLP vs attention-based MIL with ResNet-50 features and instance weighting.

Result: Baseline MLP achieved more stable performance (mean MAE = 5.6) vs attention-based MIL (mean MAE = 10.7), though MIL occasionally superior in specific folds.

Conclusion: Simple bag-level aggregation provides robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to reach full potential.

Abstract: Sebocytes are lipid-secreting cells whose differentiation is marked by the accumulation of intracellular lipid droplets, making their quantification a key readout in sebocyte biology. Manual counting is labor-intensive and subjective, motivating automated solutions. Here, we introduce a simple attention-based multiple instance learning (MIL) framework for sebocyte image analysis. Nile Red-stained sebocyte images were annotated into 14 classes according to droplet counts, expanded via data augmentation to about 50,000 cells. Two models were benchmarked: a baseline multi-layer perceptron (MLP) trained on aggregated patch-level counts, and an attention-based MIL model leveraging ResNet-50 features with instance weighting. Experiments using five-fold cross-validation showed that the baseline MLP achieved more stable performance (mean MAE = 5.6) compared with the attention-based MIL, which was less consistent (mean MAE = 10.7) but occasionally superior in specific folds. These findings indicate that simple bag-level aggregation provides a robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.

[591] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais, Francisco Mauro, Natália Lopes, Érico Medeiros, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

Main category: cs.CV

TL;DR: DermAI is a smartphone app for real-time skin lesion analysis that addresses AI dermatology limitations through on-device quality checks and local model adaptation using diverse clinical data.

DetailsMotivation: AI dermatology faces adoption barriers due to biased datasets, variable image quality, and limited validation, which DermAI aims to overcome.

Method: Lightweight smartphone application with real-time capture, annotation, classification, on-device quality checks, and local model adaptation using diverse clinical data.

Result: Models trained on public datasets failed to generalize to DermAI samples, but fine-tuning with local data improved performance significantly.

Conclusion: Standardized, diverse data collection aligned with healthcare needs is crucial for effective machine learning development in dermatology.

Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

[592] Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli

Main category: cs.CV

TL;DR: A novel adversarial training framework that reduces inter-class feature overlap by projecting out inter-class dependencies, improving both robustness and clean accuracy.

DetailsMotivation: Standard adversarial training fails to explicitly address inter-class feature overlap, which is a significant contributor to adversarial vulnerability in deep neural networks.

Method: The framework identifies nearest inter-class neighbors for each adversarial sample and removes their projections in feature space to enforce stronger feature separability, with theoretical analysis showing reduced Lipschitz constant and Rademacher complexity.

Result: Extensive experiments on CIFAR-10, CIFAR-100, and SVHN show competitive performance with leading adversarial training techniques, achieving improvements in both robust and clean accuracy.

Conclusion: Explicitly addressing inter-class feature proximity is crucial for enhancing adversarial robustness in deep neural networks.

Abstract: Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

[593] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR v1.5 is a unified vision-language framework that improves document parsing through a two-stage pipeline for layout understanding and content recognition, with specialized modules for complex tables and cross-page structures.

DetailsMotivation: Real-world documents often have complex layouts with multi-level tables, embedded images/formulas, and cross-page structures that challenge existing OCR systems, requiring better unified parsing solutions.

Method: Two-stage pipeline: first stage uses large multimodal model for joint layout and reading order prediction; second stage performs localized recognition of text, formulas, and tables. Includes visual consistency-based reinforcement learning for tables and specialized modules for image-decoupled table parsing and type-guided table merging.

Result: Achieves state-of-the-art performance on OmniDocBench v1.5, outperforming PPOCR-VL and MinerU 2.5, with exceptional robustness in visually complex document scenarios.

Conclusion: MonkeyOCR v1.5 provides an effective unified framework for complex document parsing, demonstrating superior performance and robustness through its innovative two-stage approach and specialized table handling modules.

Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage pipeline. The first stage employs a large multimodal model to jointly predict layout and reading order, leveraging visual information to ensure sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios. A trial link can be found at https://github.com/Yuliang-Liu/MonkeyOCR .

[594] Tracing and Mitigating Hallucinations in Multimodal LLMs via Dynamic Attention Localization

Tiancheng Yang, Lin Zhang, Jiaye Lin, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.CV

TL;DR: D-LEAF is a dynamic attention-guided method that identifies and corrects hallucinations in MLLMs by localizing problematic layers and attention heads, achieving significant improvements in captioning and VQA tasks with minimal overhead.

DetailsMotivation: MLLMs suffer from hallucinations where generated text conflicts with visual input, and existing methods fail to accurately localize where these errors originate in the model architecture.

Method: Proposes Layer Image Attention Entropy (LIAE) to flag anomalous layers and Image Attention Focus (IAF) to score attention heads, then uses Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF) to dynamically correct errors during inference.

Result: 53% relative improvement on captioning benchmarks, and approximately 4% improvement in both accuracy and F1-score on VQA tasks, substantially suppressing hallucinations while preserving efficiency.

Conclusion: D-LEAF effectively localizes and corrects hallucinations in MLLMs through attention-based diagnostics, with theoretical justification connecting it to DPO, achieving state-of-the-art performance with negligible computational overhead.

Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Furthermore, by establishing a connection between D-LEAF and DPO, we provide theoretical justification for the effectiveness of D-LEAF. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.

[595] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang

Main category: cs.CV

TL;DR: CoTyle enables generating novel visual styles from numerical codes by training a style codebook and autoregressive style generator, then using these embeddings to guide text-to-image diffusion models.

DetailsMotivation: Existing methods struggle with style consistency, limited creativity, and complex style representations. The paper aims to simplify style generation by proving that a single numerical code can represent and control visual styles.

Method: Train a discrete style codebook from image collections to extract style embeddings, then train an autoregressive style generator on these embeddings to model their distribution and synthesize novel style embeddings. Use these embeddings to condition a text-to-image diffusion model.

Result: CoTyle effectively turns numerical codes into style controllers, demonstrating that a style can be represented by one code. The method offers unparalleled simplicity and diversity in style generation.

Conclusion: A style is worth one numerical code. CoTyle unlocks a vast space of reproducible styles from minimal input, providing a novel approach to code-to-style image generation.

Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[596] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection

Zhongyu Xia, Hansong Yang, Yongtao Wang

Main category: cs.CV

TL;DR: InsFusion is a 3D object detection method that extracts proposals from both raw and fused features to query raw features, reducing accumulated errors in multi-view camera and LiDAR fusion through attention mechanisms.

DetailsMotivation: To address the problem of noise and error accumulation that occurs during feature extraction, perspective transformation, and feature fusion in 3D object detection from multi-view cameras and LiDAR.

Method: Proposes InsFusion which extracts proposals from both raw and fused features, uses these proposals to query raw features, and incorporates attention mechanisms on raw features to mitigate accumulated errors.

Result: Experiments on nuScenes dataset show InsFusion is compatible with various baseline methods and achieves new state-of-the-art performance for 3D object detection.

Conclusion: InsFusion effectively reduces accumulated errors in multi-sensor 3D object detection and delivers superior performance, making it a promising approach for autonomous driving applications.

Abstract: Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

[597] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types

Chi-Yu Chen, Rawan Abulibdeh, Arash Asgari, Leo Anthony Celi, Deirdre Goode, Hassan Hamidi, Laleh Seyyed-Kalantari, Ned McCague, Thomas Sounack, Po-Chih Kuo

Main category: cs.CV

TL;DR: Deep learning models can predict patients’ health insurance type (proxy for socioeconomic status) from normal chest X-rays with significant accuracy, revealing embedded social inequality signals in medical imaging data.

DetailsMotivation: To investigate whether AI models can detect invisible traces of social inequality and socioeconomic status from medical images, challenging the assumption that medical images are neutral biological data.

Method: Used state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) trained on chest X-rays from MIMIC-CXR-JPG and CheXpert datasets, with patch-based occlusion analysis to localize signals.

Result: Models achieved AUC around 0.67-0.68 predicting health insurance type, with signal persisting after controlling for age, race, and sex, and remaining detectable when trained on single racial groups. Signal was diffuse in upper and mid-thoracic regions.

Conclusion: Medical AI models learn socioeconomic segregation from clinical data, requiring reframing of fairness approaches to interrogate and disentangle embedded social fingerprints rather than just balancing datasets.

Abstract: Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.

[598] Motion-Aware Transformer for Multi-Object Tracking

Xu Yang, Gady Agam

Main category: cs.CV

TL;DR: MATR introduces a motion-aware Transformer for multi-object tracking that explicitly predicts object movements to update track queries, reducing query conflicts and improving both detection and association performance.

DetailsMotivation: Existing DETR-based MOT frameworks process detection and tracking queries jointly in a single Transformer Decoder layer, causing conflicts and degraded association accuracy due to complex object motions in crowded scenes.

Method: MATR uses a Motion-Aware Transformer that explicitly predicts object movements across frames to update track queries in advance, reducing query collisions and enabling more consistent training.

Result: MATR achieves state-of-the-art results: 71.3 HOTA on DanceTrack (9+ point improvement over MOTR), 72.2 HOTA on SportsMOT, and 54.7 mTETA/41.6 mHOTA on BDD100k without external datasets.

Conclusion: Explicitly modeling motion within end-to-end Transformers provides a simple yet highly effective approach to advancing multi-object tracking performance.

Abstract: Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.

[599] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

Main category: cs.CV

TL;DR: Attention Surgery enables efficient linear or hybrid attention in pretrained video diffusion models without retraining from scratch, achieving competitive quality while improving computational efficiency for longer videos.

DetailsMotivation: Transformer-based video diffusion models suffer from quadratic self-attention costs that make long sequences and high resolutions computationally expensive, while previous linear attention approaches fail to match softmax attention expressiveness without costly retraining.

Method: Combines hybrid attention (mixing softmax and linear tokens) with lightweight distillation and fine-tuning, plus a cost-aware block-rate strategy to balance expressiveness and efficiency across layers.

Result: Applied to Wan2.1 1.3B VDM, achieves competitive results on VBench, VBench2.0 and human preference studies, with notable improvements in on-mobile latency, memory usage, and FLOPs for longer videos.

Conclusion: Attention Surgery provides an efficient framework for enabling linear/hybrid attention in pretrained VDMs, eliminating training from scratch while maintaining quality and improving scaling behavior.

Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

[600] Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes

Ieva Bagdonaviciute, Vibhav Vineet

Main category: cs.CV

TL;DR: VLMs achieve benchmark success for wrong reasons - weak correlation between subtest mastery and accuracy shows models answer correctly without proper grounding in perception or physics.

DetailsMotivation: Traditional benchmarks only evaluate what models answer correctly, not why they succeed or fail on complex reasoning tasks, making it difficult to understand VLM capabilities.

Method: Introduced custom subtests for Physion/Physion++ and integrated existing categories for CLEVRER to isolate perception (object, color, occlusion) and physics understanding (motion prediction, spatial reasoning) capabilities.

Result: Counterintuitive weak correlation between subtest mastery and benchmark accuracy - models often answer correctly without proper grounding in perception or physics.

Conclusion: Current VLMs sometimes achieve benchmark scores for wrong reasons, highlighting need for diagnostics that expose hidden failure modes beyond aggregate metrics.

Abstract: While recent Vision-Language Models (VLMs) have achieved impressive progress, it remains difficult to determine why they succeed or fail on complex reasoning tasks. Traditional benchmarks evaluate what models can answer correctly, not why they succeed or fail. In this work, we perform a failure-mode analysis of six frontier VLMs on three physics-based benchmarks - Physion, Physion++, and CLEVRER - by introducing custom subtests (for Physion and Physion++) and an integration of existing benchmark categories (for CLEVRER) to factor benchmark performance into distinct, testable capabilities. These subtests isolate perception (object, color, and occlusion recognition) and physics understanding (motion prediction and spatial reasoning), enabling us to test whether models attend to the correct entities and dynamics underlying their answers. Counterintuitively, subtest mastery correlates only weakly with benchmark accuracy: models often answer correctly without grounding in perception or physics. This suggests that current VLMs sometimes achieve benchmark scores for the wrong reasons, underscoring the need for diagnostics that expose hidden failure modes beyond aggregate metrics.

[601] MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

James Baker

Main category: cs.CV

TL;DR: A method that uses automatically generated masks from IP-Adapter to restrict image tokens to the subject area, allowing text prompts to better control the background generation in personalized diffusion models.

DetailsMotivation: Personalized diffusion models often recreate the subject image while ignoring the text prompt, limiting control over background and scene composition.

Method: Uses IP-Adapter’s automatically generated masks in a second pass to mask image tokens, restricting them to the subject area so text prompts can attend to the background.

Result: Produces images that accurately depict the subject while definitively matching the prompt, with high prompt and source image alignment validated through user study.

Conclusion: The proposed masking approach effectively balances subject preservation and prompt adherence in personalized image generation.

Abstract: Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image and ignoring the text prompt. We observe that one popular method for personalization, IP-Adapter, automatically generates masks that segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment. We also perform a user study to validate whether end users would appreciate our method. Code available at https://github.com/jamesBaker361/monkey

[602] On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Shuhei Tarashima, Yushan Wang, Norio Tagawa

Main category: cs.CV

TL;DR: Proposes truncated hierarchical vision foundation models for efficient human mesh recovery and pose estimation, achieving comparable performance to full models with better computational efficiency.

DetailsMotivation: To develop simple and efficient models for human mesh recovery and pose estimation by leveraging early stages of hierarchical vision foundation models, addressing the computational inefficiency of large non-hierarchical transformers.

Method: Constructs lightweight variants by adapting ViTPose models and proposes using only first 2-3 stages of hierarchical VFMs (Swin Transformer, GroupMixFormer, VMamba) as encoders, based on observation that intermediate stages produce feature maps with comparable resolutions.

Result: Comprehensive evaluation of 27 hierarchical-VFM-based models shows truncated models achieve performance on par with full-stage models while exhibiting better accuracy-computation trade-offs than existing lightweight alternatives.

Conclusion: Truncated hierarchical vision foundation models provide an effective approach for efficient human mesh recovery and pose estimation, offering superior computational efficiency without sacrificing performance.

Abstract: In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives. The source code is available at https://github.com/nttcom/TruncHierVFM.

[603] REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu

Main category: cs.CV

TL;DR: REALM is an MLLM-agent framework that bridges 2D vision-language reasoning with 3D spatial understanding for open-world reasoning-based segmentation using 3D Gaussian Splatting representations.

DetailsMotivation: Existing 3D segmentation methods struggle with ambiguous, reasoning-based instructions, while 2D vision-language models lack 3D spatial understanding, creating a gap between complex human instructions and precise 3D object grounding.

Method: Uses 3D Gaussian Splatting representations to render photorealistic views for MLLM comprehension, with a Global-to-Local Spatial Grounding strategy: multiple global views for coarse localization followed by close-up views for fine-grained segmentation.

Result: Achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and REALM3D benchmarks, and supports various 3D interaction tasks including object removal, replacement, and style transfer.

Conclusion: REALM effectively bridges the gap between complex instructions and 3D object grounding without extensive 3D-specific training, demonstrating practical utility and versatility in 3D interaction tasks.

Abstract: Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.

[604] Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

Hodaka Kawachi, Tomoya Nakamura, Hiroaki Santo, SaiKiran Kumar Tedla, Trevor Dalton Canham, Yasushi Yagi, Michael S. Brown

Main category: cs.CV

TL;DR: A method for creating invisible watermarks using LED lighting that are imperceptible to humans but detectable by cameras, using spectral modulation to embed metadata for privacy and content verification.

DetailsMotivation: To develop a privacy-preserving watermarking system that can embed metadata in videos without being visible to human observers, enabling content verification and privacy protection.

Method: Optimizes LED spectral profiles to be minimally visible to humans but highly detectable by cameras, using spectral modulation instead of intensity modulation, and works with standard frame rates (30-60 fps).

Result: Successfully embeds 128 bits of data within 10-second video clips, providing sufficient capacity for essential metadata while maintaining visual imperceptibility.

Conclusion: The approach enables practical invisible watermarking for consumer applications, supporting privacy protection and content verification without requiring high-speed cameras or visible light modifications.

Abstract: This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source’s spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system’s sensitivity to visible spectra, modern consumer camera sensors’ spectral sensitivity, and narrowband LEDs’ ability to generate broadband spectra perceived as “white light” (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

[605] OmniNWM: Omniscient Driving Navigation World Models

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: OmniNWM is a panoramic navigation world model that addresses state, action, and reward dimensions in autonomous driving through unified panoramic video generation, precise trajectory control, and 3D occupancy-based rewards.

DetailsMotivation: Existing autonomous driving world models are limited in state modalities, sequence length, action precision, and reward awareness, creating a need for a comprehensive solution that addresses all three core dimensions.

Method: Uses panoramic Plucker ray-map representation for precise trajectory control, generates multiple modalities (RGB, semantics, depth, 3D occupancy) with flexible forcing strategy for long-horizon generation, and leverages 3D occupancy for rule-based dense rewards.

Result: Achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability while providing reliable closed-loop evaluation through occupancy-grounded rewards.

Conclusion: OmniNWM successfully addresses the three core dimensions of autonomous driving world models in a unified framework, demonstrating superior performance across multiple metrics and enabling effective closed-loop evaluation.

Abstract: Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://arlo0o.github.io/OmniNWM/.

[606] Toward A Better Understanding of Monocular Depth Evaluation

Siyang Wu, Jack Nugent, Willow Yang, Jia Deng

Main category: cs.CV

TL;DR: This paper analyzes evaluation metrics for monocular depth estimation, revealing their insensitivity to curvature perturbations and proposing a new metric based on relative surface normals with better human alignment.

DetailsMotivation: Current monocular depth evaluation lacks standardization and existing metrics have unclear trade-offs and behaviors, particularly their sensitivity to different types of ground truth perturbations compared to human judgment.

Method: Conducted quantitative analysis of existing metrics’ sensitivity to ground truth perturbations, introduced a new metric based on relative surface normals, developed new depth visualization tools, and created a principled method for composite metrics.

Result: Analysis revealed existing metrics are severely under-sensitive to curvature perturbations (making smooth surfaces bumpy), and the proposed relative surface normals metric shows better human alignment.

Conclusion: The paper provides improved evaluation methodology for monocular depth estimation through new metrics, visualization tools, and composite metric construction that better align with human perception.

Abstract: Monocular depth estimation is an important task with rapid progress, but how to evaluate it is not fully resolved, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not fully understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making smooth surfaces bumpy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.

[607] LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie

Main category: cs.CV

TL;DR: Efficient multimodal model fusion using pre-trained specialized models with interleaved multimodal self-attention blocks, achieving strong performance with minimal training.

DetailsMotivation: To create competitive multimodal models more efficiently by leveraging existing specialized models rather than training from scratch, reducing computational requirements.

Method: Strategic fusion of public generation and understanding models by retaining original blocks while interleaving multimodal self-attention blocks throughout networks, enabling double fusion mechanism.

Result: Achieved strong results with only ~35B tokens: 0.91 on GenEval, 82.16 on DPG-Bench, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench across text-to-image generation and image editing tasks.

Conclusion: Proposed fusion approach enables efficient multimodal modeling with competitive performance while preserving original model strengths, with full code and model release to support future research.

Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

[608] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction

Li Wang, Yiyu Zhuang, Yanwen Wang, Xun Cao, Chuan Guo, Xinxin Zuo, Hao Zhu

Main category: cs.CV

TL;DR: A novel approach for 3D human pose estimation from sketches using synthetic data generation and diffusion models, achieving superior accuracy and speed compared to previous methods.

DetailsMotivation: Traditional sketch-to-pose methods are limited by lack of large-scale sketch-3D pose annotations, relying on time-consuming optimization with heuristic rules that have poor generalizability.

Method: Uses ’learn from synthesis’ strategy: trains diffusion model to synthesize sketches from 2D poses, creates SKEP-120K synthetic dataset, combines 2D pose detectors with diffusion priors and feed-forward network, incorporates heuristic loss functions for geometric coherence.

Result: Model substantially surpasses previous methods in both estimation accuracy and speed for sketch-to-pose tasks, as shown by qualitative, quantitative, and subjective evaluations.

Conclusion: The proposed framework effectively addresses the challenges of sketch-based 3D pose estimation through synthetic data generation and data-driven approach, enabling accurate and efficient pose estimation from diverse sketch styles.

Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a “learn from synthesis” strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.

[609] RefVTON: person-to-person Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: RefTON is a flux-based virtual try-on framework that uses unpaired reference images to enhance garment realism without needing complex auxiliary inputs like body parsing or warped masks.

DetailsMotivation: To simplify virtual try-on by eliminating the need for complex auxiliary inputs and structural guidance, while improving garment realism through reference-based learning inspired by human clothing selection behavior.

Method: Uses flux-based generation with unpaired visual references, directly generating try-on results from source image and target garment without structural guidance or auxiliary components.

Result: Achieves competitive or superior performance compared to state-of-the-art methods on public benchmarks while maintaining simple and efficient person-to-person design.

Conclusion: RefTON provides an effective virtual try-on solution that enhances garment realism through reference guidance while simplifying the overall framework architecture.

Abstract: We introduce RefTON, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTON streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTON leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTON achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.

[610] A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model

Sampriti Soor, Alik Pramanick, Jothiprakash K, Arijit Sur

Main category: cs.CV

TL;DR: A generative adversarial attack method using CLIP model to create effective yet visually imperceptible perturbations that deceive multilabel classifiers while maintaining high structural similarity to original images.

DetailsMotivation: Deep learning models are vulnerable to adversarial attacks where subtle alterations can cause inaccurate predictions. Existing methods need improvement in creating effective perturbations that remain visually imperceptible.

Method: Integrates CLIP model’s text-image alignment with guided loss to incorporate natural language semantics. Combines concentrated perturbation strategy from SSAE with dissimilar text embeddings from GAMA for multi-object scene manipulation.

Result: The method achieves competitive or superior performance compared to existing techniques across various black-box victim models while preserving greater visual fidelity.

Conclusion: The proposed approach successfully generates effective adversarial examples that deceive classification models while maintaining high visual similarity to original inputs, demonstrating the power of integrating CLIP’s cross-modal capabilities with perturbation strategies.

Abstract: The rapid growth of deep learning has brought about powerful models that can handle various tasks, like identifying images and understanding language. However, adversarial attacks, an unnoticed alteration, can deceive models, leading to inaccurate predictions. In this paper, a generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations. The CLIP model’s ability to align text and image representation helps incorporate natural language semantics with a guided loss to generate effective adversarial examples that look identical to the original inputs. This integration allows extensive scene manipulation, creating perturbations in multi-object environments specifically designed to deceive multilabel classifiers. Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder (SSAE) with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA), resulting in perturbations that both deceive classification models and maintain high structural similarity to the original images. The model was tested on various tasks across diverse black-box victim models. The experimental results show that our method performs competitively, achieving comparable or superior results to existing techniques, while preserving greater visual fidelity.

[611] Language-Enhanced Generative Modeling for Amyloid PET Synthesis from MRI and Blood Biomarkers

Zhengjie Zhang, Xiaoxie Mao, Qihao Guo, Shaoting Zhang, Qi Huang, Mu Zhou, Fang Xie, Mianxin Liu

Main category: cs.CV

TL;DR: A language-enhanced generative model synthesizes amyloid-beta PET images from blood biomarkers and MRI, achieving high similarity to real PET scans and improving Alzheimer’s diagnosis accuracy.

DetailsMotivation: Amyloid-beta PET is expensive and inaccessible for Alzheimer's diagnosis, creating need for alternative methods using more accessible blood biomarkers and MRI.

Method: Developed LLM-driven multimodal fusion model to synthesize PET images from blood biomarkers and T1-weighted MRI scans from 566 participants, with automated diagnostic pipeline.

Result: Synthetic PET closely matches real PET (SSIM=0.920, r=0.955), achieves 80% diagnostic accuracy, and outperforms MRI-only (AUC=0.68) and biomarker-only (AUC=0.73) models with AUC=0.78.

Conclusion: Language-enhanced generative modeling enables realistic PET synthesis from accessible data, improving Alzheimer’s diagnostic workflow and spatial pattern assessment.

Abstract: Background: Alzheimer’s disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson’s r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer’s disease.

[612] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui

Main category: cs.CV

TL;DR: UniPruneBench is a unified benchmark for evaluating visual token pruning methods in multimodal LLMs, covering 6 ability dimensions, 10 datasets, 10 compression algorithms, and 3 LMM families with standardized protocols.

DetailsMotivation: Current token compression methods for multimodal models suffer from fragmented and inconsistent evaluation, making it difficult to compare different approaches and understand their true effectiveness.

Method: Developed UniPruneBench with standardized protocols across multiple dimensions, incorporating both task accuracy and system-level metrics (runtime, prefilling latency) to evaluate various pruning algorithms on different LMM families.

Result: Key findings: random pruning is surprisingly strong, no single method consistently outperforms others, pruning sensitivity varies by task (OCR most vulnerable), and pruning ratio is the dominant factor in performance degradation.

Conclusion: UniPruneBench provides a reliable foundation for future research on efficient multimodal modeling by offering standardized evaluation and revealing important insights about token pruning behavior.

Abstract: Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

[613] Global 3D Reconstruction of Clouds & Tropical Cyclones

Shirin Ermis, Cesar Aybar, Lilli Freischem, Stella Girtsou, Kyriaki-Margarita Bintsi, Emiliano Diaz Salas-Porras, Michael Eisinger, William Jones, Anna Jungbluth, Benoit Tremblay

Main category: cs.CV

TL;DR: A new framework using pre-training and fine-tuning translates 2D satellite imagery into 3D cloud maps for tropical cyclones, enabling global 3D cloud reconstruction even when observations are missing.

DetailsMotivation: Tropical cyclone forecasting is challenging due to limited satellite observations of TC structure and difficulties resolving cloud properties involved in intensification.

Method: Pre-training–fine-tuning pipeline that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties.

Result: First global instantaneous 3D cloud maps created, accurately reconstructing 3D structure of intense storms, extending available observations and providing estimates when observations are missing.

Conclusion: This framework is crucial for advancing understanding of TC intensification and improving forecasts by providing comprehensive 3D cloud structure data.

Abstract: Accurate forecasting of tropical cyclones (TCs) remains challenging due to limited satellite observations probing TC structure and difficulties in resolving cloud properties involved in TC intensification. Recent research has demonstrated the capabilities of machine learning methods for 3D cloud reconstruction from satellite observations. However, existing approaches have been restricted to regions where TCs are uncommon, and are poorly validated for intense storms. We introduce a new framework, based on a pre-training–fine-tuning pipeline, that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties. We apply our model to a custom-built TC dataset to evaluate performance in the most challenging and relevant conditions. We show that we can - for the first time - create global instantaneous 3D cloud maps and accurately reconstruct the 3D structure of intense storms. Our model not only extends available satellite observations but also provides estimates when observations are missing entirely. This is crucial for advancing our understanding of TC intensification and improving forecasts.

[614] Hybrid second-order gradient histogram based global low-rank sparse regression for robust face recognition

Hongxia Li, Ying Ji, Yongxin Dong, Yuehua Feng

Main category: cs.CV

TL;DR: Proposes H2H-GLRSR model combining hybrid second-order gradient histograms with global low-rank sparse regression for improved face recognition under occlusion and illumination variations.

DetailsMotivation: Existing low-rank sparse regression methods suffer from insufficient feature representation and limited modeling of structured corruption across samples in face recognition.

Method: Develops Histogram of Oriented Hessian (HOH) for second-order geometric features, fuses with first-order gradients to create H2H descriptor, and incorporates into extended SR_NMR model with global low-rank constraint on residual matrix.

Result: Significantly outperforms state-of-the-art regression-based classifiers in both recognition accuracy and computational efficiency on benchmark datasets.

Conclusion: The H2H-GLRSR model achieves superior discrimination and robustness by effectively capturing structural features and exploiting cross-sample correlations in structured noise.

Abstract: Low-rank sparse regression models have been widely adopted in face recognition due to their robustness against occlusion and illumination variations. However, existing methods often suffer from insufficient feature representation and limited modeling of structured corruption across samples. To address these issues, this paper proposes a Hybrid second-order gradient Histogram based Global Low-Rank Sparse Regression (H2H-GLRSR) model. First, we propose the Histogram of Oriented Hessian (HOH) to capture second-order geometric characteristics such as curvature and ridge patterns. By fusing HOH and first-order gradient histograms, we construct a unified local descriptor, termed the Hybrid second-order gradient Histogram (H2H), which enhances structural discriminability under challenging conditions. Subsequently, the H2H features are incorporated into an extended version of the Sparse Regularized Nuclear Norm based Matrix Regression (SR_NMR) model, where a global low-rank constraint is imposed on the residual matrix to exploit cross-sample correlations in structured noise. The resulting H2H-GLRSR model achieves superior discrimination and robustness. Experimental results on benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art regression-based classifiers in both recognition accuracy and computational efficiency.

[615] Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory

Yuxuan Lin, Hanjing Yan, Xuan Tong, Yang Chang, Huanzhen Wang, Ziheng Zhou, Shuyong Gao, Yan Wang, Wenqiang Zhang

Main category: cs.CV

TL;DR: CIF is a few-shot multimodal industrial anomaly detection method that uses hypergraphs to extract structural commonality from limited training samples and employs memory banks for detection.

DetailsMotivation: Few-shot settings in industrial anomaly detection face challenges due to insufficient training samples that fail to cover diverse test patterns. Extracting structural commonality from limited samples can help mitigate this issue.

Method: Proposes CIF method with three modules: 1) Semantic-aware hypergraph construction for intra-class structural information, 2) Training-free hypergraph message passing to update test features, 3) Hyperedge-guided memory search using structural information to reduce false positives.

Result: Experimental results on MVTec 3D-AD and Eyecandies datasets show CIF outperforms state-of-the-art methods in few-shot settings.

Conclusion: CIF effectively addresses few-shot multimodal industrial anomaly detection by leveraging structural commonality through hypergraphs and memory banks, achieving superior performance compared to existing methods.

Abstract: Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.

[616] NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling

Muhammad Usama, Mohammad Sadil Khan, Didier Stricker, Muhammad Zeshan Afzal

Main category: cs.CV

TL;DR: NURBGen is the first framework that generates editable 3D CAD models directly from text using NURBS representations, outperforming previous methods in geometric fidelity and accuracy.

DetailsMotivation: Existing text-to-CAD systems either produce non-editable meshes or rely on scarce design-history data, creating a need for direct generation of editable CAD models from natural language.

Method: Fine-tune a large language model to translate text into JSON representations of NURBS surface parameters, using a hybrid representation combining untrimmed NURBS with analytic primitives to handle complex surfaces and reduce token complexity.

Result: NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy as confirmed by expert evaluations.

Conclusion: The framework successfully generates high-fidelity 3D CAD models directly from text using NURBS, with the code and partABC dataset to be released publicly.

Abstract: Generating editable 3D CAD models from natural language remains challenging, as existing text-to-CAD systems either produce meshes or rely on scarce design-history data. We present NURBGen, the first framework to generate high-fidelity 3D CAD models directly from text using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. Additionally, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.

[617] Spatially-Aware Mixture of Experts with Log-Logistic Survival Modeling for Whole-Slide Images

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Shivam Madnoorkar, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A comprehensive computational pathology framework for survival prediction from whole-slide images using quantile-gated patch selection, graph-guided clustering, hierarchical context attention, and expert-driven mixture modeling.

DetailsMotivation: Accurate survival prediction from histopathology whole-slide images is challenging due to gigapixel resolution, spatial heterogeneity, and complex survival distributions.

Method: Four innovations: Quantile-Gated Patch Selection for relevant regions, Graph-Guided Clustering for patch grouping, Hierarchical Context Attention for local/global modeling, and Expert-Driven Mixture of Log-Logistics for survival distributions.

Result: State-of-the-art performance on TCGA cohorts with time-dependent concordance indices: 0.644 on LUAD, 0.751 on KIRC, and 0.752 on BRCA, outperforming histology-only and multimodal baselines.

Conclusion: The framework provides improved calibration and interpretability, advancing the use of whole-slide images for personalized cancer prognosis.

Abstract: Accurate survival prediction from histopathology whole-slide images (WSIs) remains challenging due to their gigapixel resolution, strong spatial heterogeneity, and complex survival distributions. We introduce a comprehensive computational pathology framework that addresses these limitations through four complementary innovations: (1) Quantile-Gated Patch Selection for dynamically identifying prognostically relevant regions, (2) Graph-Guided Clustering to group patches by spatial and morphological similarity, (3) Hierarchical Context Attention to model both local tissue interactions and global slide-level context, and (4) an Expert-Driven Mixture of Log-Logistics module that flexibly models complex survival distributions. Across large TCGA cohorts, our method achieves state-of-the-art performance, yielding time-dependent concordance indices of 0.644 on LUAD, 0.751 on KIRC, and 0.752 on BRCA, consistently outperforming both histology-only and multimodal baselines. The framework further provides improved calibration and interpretability, advancing the use of WSIs for personalized cancer prognosis.

[618] SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection

Xin Zuo, Chenyu Qu, Haibo Zhan, Jifeng Shen, Wankou Yang

Main category: cs.CV

TL;DR: SFFR method uses KAN networks for spatial-frequency feature reconstruction in multispectral object detection, with FCEKAN for frequency component exchange and MSGKAN for multi-scale spatial feature modeling, achieving superior performance on UAV datasets.

DetailsMotivation: Current multispectral object detection methods focus mainly on spatial-domain feature fusion using CNNs or Transformers, while frequency-domain features remain underexplored despite their potential for complementary representations.

Method: Proposed SFFR method with two core modules: FCEKAN for selective frequency component exchange between RGB and IR images, and MSGKAN using multi-scale Gaussian basis functions to capture spatial feature variations at different UAV flight altitudes.

Result: Extensive experiments on SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate superior performance and significant advantages in UAV multispectral object perception tasks.

Conclusion: The FCEKAN and MSGKAN modules are complementary, effectively capturing frequency and spatial semantic features respectively for better feature fusion in multispectral object detection.

Abstract: Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model’s adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.

[619] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen

Main category: cs.CV

TL;DR: SportR is a multi-sports benchmark with 5,017 images and 2,101 videos, featuring progressive QA pairs and 7,118 human-authored Chain of Thought annotations to evaluate multimodal models’ sports reasoning capabilities.

DetailsMotivation: Current sports benchmarks lack detailed reasoning chains and precise visual grounding needed to evaluate core capabilities like nuanced visual perception, rule knowledge application, and visual evidence grounding in a multi-sport context.

Method: Created a large-scale benchmark with hierarchical QA pairs progressing from simple infraction identification to complex penalty prediction, incorporating both image and video modalities with manual bounding box annotations for visual grounding.

Result: State-of-the-art baseline models perform poorly on challenging tasks, and while training on SportR data improves scores, they remain relatively low, highlighting significant capability gaps in current multimodal models.

Conclusion: SportR presents a challenging benchmark that exposes limitations in current multimodal reasoning capabilities and provides a critical resource to drive future research in sports intelligence and multimodal reasoning.

Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

[620] Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu, Shijian Li

Main category: cs.CV

TL;DR: Feature distillation fails for ViTs due to representational mismatch between teacher and student models, unlike CNNs where it works well.

DetailsMotivation: To understand why feature-based knowledge distillation works for CNNs but fails for Vision Transformers, and identify the root causes of this negative transfer phenomenon.

Method: Developed ‘distillation dynamics’ analytical framework combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking to study ViT behavior during distillation.

Result: Found ViTs exhibit U-shaped information processing pattern and identified representational paradigm mismatch as root cause - teacher models use distributed high-dimensional encoding that students can’t replicate due to limited channel capacity.

Conclusion: Successful ViT distillation requires moving beyond naive feature mimicry to methods respecting representational constraints, providing theoretical guidance for effective ViT compression strategies.

Abstract: While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as “distillation dynamics”, combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided at https://github.com/thy960112/Distillation-Dynamics.

[621] DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion

Da-Yeong Kim, Yeong-Jun Cho

Main category: cs.CV

TL;DR: DANCE is a density-agnostic point cloud completion network that preserves observed geometry while completing missing regions using ray-based sampling and transformer refinement with semantic guidance.

DetailsMotivation: Existing point cloud completion methods assume fixed input/output densities or rely on image-based representations, making them unsuitable for real-world scenarios with variable sparsity and limited supervision.

Method: Uses ray-based sampling from multiple viewpoints to generate candidate points, then refines positions and predicts opacity scores via transformer decoder. Includes lightweight classification head for semantic guidance using geometric features.

Result: Outperforms state-of-the-art methods on PCN and MVP benchmarks in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

Conclusion: DANCE provides effective point cloud completion that preserves observed geometry and works well in real-world conditions with variable sparsity, without requiring external image supervision.

Abstract: Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

[622] SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

Kaiyi Xu, Junchao Gong, Zhiwang Zhou, Zhangrui Li, Yuandong Pu, Yihao Liu, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai

Main category: cs.CV

TL;DR: SynWeather is the first unified dataset for multi-region, multi-variable weather data synthesis, and SynWeatherDiff is a diffusion transformer model that addresses over-smoothing in weather synthesis.

DetailsMotivation: Current weather data synthesis approaches are limited to single-variable, single-region tasks using deterministic modeling, which restricts unified synthesis across variables and regions, overlooks cross-variable complementarity, and causes over-smoothed results.

Method: Introduced SynWeather dataset covering four regions (Continental US, Europe, East Asia, Tropical Cyclone regions) with high-resolution observations of key weather variables. Developed SynWeatherDiff, a probabilistic weather synthesis model based on Diffusion Transformer framework to address over-smoothing.

Result: Experiments on the SynWeather dataset demonstrated the effectiveness of SynWeatherDiff compared to both task-specific and general models.

Conclusion: SynWeather enables unified multi-region and multi-variable weather observation data synthesis, and SynWeatherDiff provides an effective probabilistic solution that overcomes limitations of current deterministic approaches.

Abstract: With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.

[623] Mitigating Negative Flips via Margin Preserving Training

Simone Ricci, Niccolò Biondi, Federico Pernici, Alberto Del Bimbo

Main category: cs.CV

TL;DR: Proposes a method to reduce negative flips (inconsistencies) in AI model updates by preserving original model margins while learning improved models, using margin-calibration and double-source focal distillation.

DetailsMotivation: Minimizing inconsistencies across successive AI versions is crucial, especially negative flips where updated models misclassify previously correct samples. This worsens with more training classes over time as new categories reduce class margins and introduce conflicting patterns.

Method: Preserves original model margins while learning improved model. Uses explicit margin-calibration term on logits and integrates double-source focal distillation loss with previous model and new independently trained model to learn appropriate decision margins from both old and new data.

Result: Extensive experiments on image classification benchmarks show the approach consistently reduces negative flip rate while maintaining high overall accuracy.

Conclusion: The proposed method effectively mitigates negative flips in model updates by balancing margin preservation and new class learning through margin-calibration and dual-source distillation.

Abstract: Minimizing inconsistencies across successive versions of an AI system is as crucial as reducing the overall error. In image classification, such inconsistencies manifest as negative flips, where an updated model misclassifies test samples that were previously classified correctly. This issue becomes increasingly pronounced as the number of training classes grows over time, since adding new categories reduces the margin of each class and may introduce conflicting patterns that undermine their learning process, thereby degrading performance on the original subset. To mitigate negative flips, we propose a novel approach that preserves the margins of the original model while learning an improved one. Our method encourages a larger relative margin between the previously learned and newly introduced classes by introducing an explicit margin-calibration term on the logits. However, overly constraining the logit margin for the new classes can significantly degrade their accuracy compared to a new independently trained model. To address this, we integrate a double-source focal distillation loss with the previous model and a new independently trained model, learning an appropriate decision margin from both old and new data, even under a logit margin calibration. Extensive experiments on image classification benchmarks demonstrate that our approach consistently reduces the negative flip rate with high overall accuracy.

[624] WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

Yifei Sun, Yuzhi He, Junhao Jia, Jinhong Wang, Ruiquan Ge, Changmiao Wang, Hongxia Xu

Main category: cs.CV

TL;DR: WDT-MD is a Wavelet Diffusion Transformer framework that addresses limitations in diffusion-based microaneurysm detection by preventing identity mapping, distinguishing MAs from other anomalies, and improving normal feature reconstruction.

DetailsMotivation: Current diffusion-based anomaly detection methods for microaneurysm screening suffer from identity mapping, inability to distinguish MAs from other anomalies, and poor reconstruction of normal features, limiting clinical application.

Method: Proposes WDT-MD with three innovations: noise-encoded image conditioning to avoid identity mapping, pseudo-normal pattern synthesis via inpainting for pixel-level supervision, and wavelet diffusion Transformer combining global modeling with multi-scale analysis.

Result: Outperforms state-of-the-art methods on IDRiD and e-ophtha MA datasets in both pixel-level and image-level microaneurysm detection.

Conclusion: The framework shows significant promise for improving early diabetic retinopathy screening by addressing fundamental limitations of existing diffusion-based approaches.

Abstract: Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 $μm$ lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to “identity mapping”, where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid “identity mapping” by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.

[625] Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving Infrared Small Target Detection

Houzhang Fang, Shukai Guo, Qiuhuan Chen, Yi Chang, Luxin Yan

Main category: cs.CV

TL;DR: TDCNet is a novel moving infrared small target detection network that combines temporal difference and 3D convolution through re-parameterized TDC blocks to capture multi-scale motion contextual features while suppressing background clutter.

DetailsMotivation: Moving IRSTD is challenging due to weak target features and complex background interference. Existing methods using temporal differences lack spatial feature extraction, while 3D convolutions lack explicit motion awareness.

Method: Proposes TDCNet with temporal difference convolution re-parameterization module containing three parallel TDC blocks that fuse temporal difference and 3D convolution. Also includes TDC-guided spatio-temporal attention mechanism for cross-attention between TDC-based and 3D backbone features.

Result: Extensive experiments on IRSTD-UAV and public infrared datasets show state-of-the-art detection performance in moving target detection.

Conclusion: TDCNet effectively captures spatio-temporal features for accurate moving infrared small target detection by combining temporal difference and 3D convolution approaches.

Abstract: Moving infrared small target detection (IRSTD) plays a critical role in practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based search system. Moving IRSTD still remains highly challenging due to weak target features and complex background interference. Accurate spatio-temporal feature modeling is crucial for moving target detection, typically achieved through either temporal differences or spatio-temporal (3D) convolutions. Temporal difference can explicitly leverage motion cues but exhibits limited capability in extracting spatial features, whereas 3D convolution effectively represents spatio-temporal features yet lacks explicit awareness of motion dynamics along the temporal dimension. In this paper, we propose a novel moving IRSTD network (TDCNet), which effectively extracts and enhances spatio-temporal features for accurate target detection. Specifically, we introduce a novel temporal difference convolution (TDC) re-parameterization module that comprises three parallel TDC blocks designed to capture contextual dependencies across different temporal ranges. Each TDC block fuses temporal difference and 3D convolution into a unified spatio-temporal convolution representation. This re-parameterized module can effectively capture multi-scale motion contextual features while suppressing pseudo-motion clutter in complex backgrounds, significantly improving detection performance. Moreover, we propose a TDC-guided spatio-temporal attention mechanism that performs cross-attention between the spatio-temporal features from the TDC-based backbone and a parallel 3D backbone. This mechanism models their global semantic dependencies to refine the current frame’s features. Extensive experiments on IRSTD-UAV and public infrared datasets demonstrate that our TDCNet achieves state-of-the-art detection performance in moving target detection.

[626] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

Main category: cs.CV

TL;DR: Proposes MMaDA-Parallel, a parallel multimodal diffusion framework that addresses performance degradation in thinking-aware generation by enabling continuous bidirectional interaction between text and images, achieving 6.9% improvement in cross-modal alignment.

DetailsMotivation: Existing sequential, autoregressive approaches for thinking-aware generation can paradoxically degrade performance due to error propagation, particularly showing poor alignment between generated reasoning and final image outputs.

Method: Introduces MMaDA-Parallel framework with parallel multimodal diffusion for continuous bidirectional text-image interaction, trained with supervised finetuning and optimized via Parallel Reinforcement Learning (ParaRL) that applies semantic rewards along the trajectory.

Result: Achieves 6.9% improvement in Output Alignment on ParaBench benchmark compared to state-of-the-art model Bagel, significantly improving cross-modal alignment and semantic consistency.

Conclusion: Establishes a more robust paradigm for thinking-aware image synthesis by addressing error propagation through parallel multimodal interaction and trajectory-based reinforcement learning.

Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

[627] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Felix B. Mueller, Jan F. Meier, Timo Lueddecke, Richard Vogg, Roger L. Freixanet, Valentin Hassler, Tiffany Bosshard, Elif Karakoc, William J. O’Hearn, Sofia M. Pereira, Sandro Sehner, Kaja Wierucka, Judith Burkart, Claudia Fichtel, Julia Fischer, Alexander Gail, Catherine Hobaiter, Julia Ostner, Liran Samuni, Oliver Schülke, Neda Shahidi, Erin G. Wessling, Alexander S. Ecker

Main category: cs.CV

TL;DR: PriVi is a large-scale primate-centric video dataset for pretraining computer vision models, enabling better generalization and data efficiency in primate behavior analysis compared to human-centric models.

DetailsMotivation: Existing computer vision methods for primate behavior analysis rely on human-centric pretrained models and single datasets, limiting generalization across different primate species and research settings.

Method: Created PriVi dataset with 424 hours of curated primate videos, pretrained V-JEPA model on this data, and evaluated using lightweight frozen classifier across four benchmark datasets.

Result: Outperformed prior work including fully finetuned baselines across all four datasets (ChimpACT, BaboonLand, PanAf500, ChimpBehave), with better scaling using fewer labels.

Conclusion: Primate-centric pretraining substantially improves data efficiency and generalization, making it promising for low-label applications in primate behavior research.

Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA, a large-scale video model, on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.

[628] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection

Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao, Shu Hu

Main category: cs.CV

TL;DR: A dual-mechanism framework that improves fairness in deepfake detection without sacrificing accuracy through structural fairness decoupling and global distribution alignment.

DetailsMotivation: Current fairness-enhanced deepfake detectors often improve fairness at the cost of detection accuracy, which is problematic for digital identity security applications where both fairness and accuracy are crucial.

Method: Proposes a dual-mechanism collaborative optimization framework that integrates structural fairness decoupling (decoupling demographic-sensitive channels at model architecture level) and global distribution alignment (reducing distance between overall sample distribution and demographic group distributions at feature level).

Result: Experimental results show the framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains, outperforming other methods.

Conclusion: The proposed framework successfully addresses the fairness-accuracy trade-off in deepfake detection, enabling more equitable deployment of detection models in digital identity security applications.

Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.

[629] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery

Prince Mensah, Pelumi Victor Aderinto, Ibrahim Salihu Yusuf, Arnu Pretorius

Main category: cs.CV

TL;DR: A physics-informed Transformer-VAE architecture that uses simulated data to invert the PROSAIL radiative transfer model for vegetation parameter estimation from Sentinel-2 data, achieving performance comparable to methods using real satellite imagery.

DetailsMotivation: To develop a cost-effective solution for vegetation biophysical variable retrieval that doesn't require real satellite images or in-situ labels for training, enabling global ecosystem monitoring without expensive data collection.

Method: Transformer-VAE architecture incorporating PROSAIL radiative transfer model as a differentiable physical decoder, trained exclusively on simulated data to ensure physically plausible leaf and canopy property inference.

Result: Achieved comparable accuracy to state-of-the-art methods using real imagery on field datasets (FRM4Veg and BelSAR) for retrieving LAI and canopy chlorophyll content, without requiring real satellite data or calibration.

Conclusion: The approach demonstrates that integrating physical models with deep networks enables effective RTM inversion using only simulated data, offering a self-supervised solution for large-scale vegetation monitoring.

Abstract: Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.

[630] Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha

Main category: cs.CV

TL;DR: Viper-F1 is an efficient multimodal language model that replaces attention with state-space dynamics for linear-time inference while maintaining fine-grained visual understanding through token-grid correlation.

DetailsMotivation: Address the high computational cost of existing MLLMs and their poor performance on fine-grained reasoning tasks in resource-constrained scenarios like robotics and smart devices.

Method: Hybrid State-Space Vision-Language Model using Liquid State-Space Dynamics instead of attention, with Token-Grid Correlation Module for visual grounding via FiLM conditioning.

Result: Achieves accurate fine-grained understanding with significantly improved efficiency across multiple benchmarks while maintaining linear-time inference.

Conclusion: Viper-F1 provides an efficient alternative to attention-based MLLMs, enabling deployment in resource-constrained scenarios without sacrificing fine-grained visual reasoning capabilities.

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.

[631] Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs

Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu

Main category: cs.CV

TL;DR: Arcee introduces cross-block recurrent state chains in Mamba models, reusing terminal state-space representations between blocks to improve performance in vision tasks without adding parameters or significant computational cost.

DetailsMotivation: Current Mamba models for vision reinitialize state-space dynamics from zero at each block, discarding valuable terminal state representations from previous blocks, which limits their modeling capacity.

Method: Arcee creates a differentiable boundary map that passes each block’s terminal state-space representation as the initial condition to the next block, enabling gradient flow across blocks while maintaining compatibility with existing vision-mamba variants.

Result: On CelebA-HQ 256x256 unconditional generation with Flow Matching, Arcee reduces FID from 82.81 to 15.33 (5.4x improvement) on a Zigzag Mamba baseline.

Conclusion: Arcee provides a parameter-free, efficient method to improve Mamba models for vision by preserving cross-block memory through recurrent state chains, significantly enhancing generation quality.

Abstract: State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent “Mamba-for-vision” variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block’s state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block’s terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior “vision-mamba” variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.

[632] CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation

Luthira Abeykoon, Ved Patel, Gawthaman Senthilvelan, Darshan Kasundra

Main category: cs.CV

TL;DR: CVChess is a deep learning framework that converts physical chessboard images to Forsyth-Edwards Notation (FEN) using a CNN with residual layers for piece recognition, enabling chess engine assistance for physical games.

DetailsMotivation: To bridge the gap between digital and physical chess experiences by providing real-time assistance for physical chess games, similar to what's available in online platforms.

Method: Uses a CNN with residual layers for piece recognition from smartphone images, involving Hough Line Transform for edge detection, projective transform for board alignment, segmentation into 64 squares, and classification into 13 piece classes.

Result: The system successfully processes chessboard images and generates FEN notation that can be fed into chess engines to determine optimal moves.

Conclusion: CVChess provides an effective solution for bringing digital chess assistance to physical games, using computer vision and deep learning to bridge the analog-digital divide in chess.

Abstract: Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move

cs.AI

[633] LLM-Generated Negative News Headlines Dataset: Creation and Benchmarking Against Real Journalism

Olusola Babalola, Bolanle Ojokoh, Olutayo Boyinbode

Main category: cs.AI

TL;DR: LLM-generated synthetic news headlines can effectively replace real-world data for NLP tasks, particularly for negative sentiment analysis, with minimal differences from real headlines.

DetailsMotivation: To overcome data acquisition challenges and privacy concerns associated with real-world data by using LLM-generated synthetic datasets for NLP tasks.

Method: Created synthetic negative news headlines using tailored LLM prompts, validated through expert review, and analyzed in embedding space. Compared with real headlines using multiple metrics including perplexity, readability, POS profiling, BERTScore, and semantic similarity.

Result: Synthetic headlines closely match real headlines across most metrics, with the only significant difference being in proper noun usage in POS profiling.

Conclusion: LLM-generated synthetic datasets are a viable alternative to real-world data for sentiment analysis tasks, addressing privacy and data acquisition issues while maintaining quality.

Abstract: This research examines the potential of datasets generated by Large Language Models (LLMs) to support Natural Language Processing (NLP) tasks, aiming to overcome challenges related to data acquisition and privacy concerns associated with real-world data. Focusing on negative valence text, a critical component of sentiment analysis, we explore the use of LLM-generated synthetic news headlines as an alternative to real-world data. A specialized corpus of negative news headlines was created using tailored prompts to capture diverse negative sentiments across various societal domains. The synthetic headlines were validated by expert review and further analyzed in embedding space to assess their alignment with real-world negative news in terms of content, tone, length, and style. Key metrics such as correlation with real headlines, perplexity, coherence, and realism were evaluated. The synthetic dataset was benchmarked against two sets of real news headlines using evaluations including the Comparative Perplexity Test, Comparative Readability Test, Comparative POS Profiling, BERTScore, and Comparative Semantic Similarity. Results show the generated headlines match real headlines with the only marked divergence being in the proper noun score of the POS profile test.

[634] CLINB: A Climate Intelligence Benchmark for Foundational Models

Michelle Chen Huebscher, Katharine Mach, Aleksandar Stanić, Markus Leippold, Ben Gaiarin, Zeke Hausfather, Elisa Rawat, Erich Fischer, Massimiliano Ciaramita, Joeri Rogelj, Christian Buck, Lierni Sestorain Saralegui, Reto Knutti

Main category: cs.AI

TL;DR: CLINB benchmark evaluates LLMs on climate change knowledge, revealing strong knowledge synthesis but poor grounding with high hallucination rates.

DetailsMotivation: To assess how LLMs handle complex, specialized knowledge in climate science through rigorous evaluation of knowledge quality and evidential support.

Method: Created CLINB benchmark using real users’ questions and expert-curated rubrics, implementing model-based evaluation of frontier models on multimodal question answering.

Result: Frontier models show PhD-level knowledge synthesis and outperform expert-assisted hybrid answers, but suffer from poor grounding with substantial hallucination in references and images.

Conclusion: Bridging the gap between knowledge synthesis and verifiable attribution is crucial for AI deployment in science, requiring reliable benchmarks like CLINB for trustworthy AI systems.

Abstract: Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users’ questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform “hybrid” answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.

[635] SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detectio

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

Main category: cs.AI

TL;DR: SynBullying is a synthetic multi-LLM conversational dataset for cyberbullying detection that provides realistic bullying interactions through scalable and ethically safe data generation.

DetailsMotivation: To create a scalable and ethically safe alternative to human data collection for studying cyberbullying, addressing limitations of isolated post analysis and providing context-aware annotations.

Method: Leverage large language models (LLMs) to simulate realistic bullying interactions with conversational structure, context-aware annotations, and fine-grained labeling across various CB categories.

Result: The dataset was evaluated across five dimensions including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution, and showed utility as both standalone training data and augmentation source for CB classification.

Conclusion: SynBullying provides an effective synthetic dataset for cyberbullying research that captures realistic conversational dynamics and enables comprehensive analysis of bullying behaviors.

Abstract: We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

[636] CausalGuard: A Smart System for Detecting and Preventing False Information in Large Language Models

Piyushkumar Patel

Main category: cs.AI

TL;DR: CausalGuard is a new approach that combines causal reasoning with symbolic logic to detect and prevent hallucinations in large language models by understanding the causal chain leading to false statements and intervening early in the generation process.

DetailsMotivation: Large language models have a critical weakness of confidently stating false information (hallucinations), which has become a major barrier to using these models in accuracy-critical applications. Existing solutions require retraining, add significant computational costs, or miss the root causes of why hallucinations occur.

Method: CausalGuard works through two complementary paths: tracing causal relationships between what the model knows and what it generates, and checking logical consistency using automated reasoning. Unlike previous methods that only check outputs after generation, it understands the causal chain leading to false statements and intervenes early.

Result: Testing across twelve benchmarks showed CausalGuard correctly identifies hallucinations 89.3% of the time with only 8.3% missed hallucinations. It reduces false claims by nearly 80% while keeping responses natural and helpful, performing especially well on complex reasoning tasks requiring multiple logic steps.

Conclusion: CausalGuard effectively addresses the hallucination problem in LLMs by combining causal reasoning with symbolic logic, providing transparent reasoning processes that make it suitable for sensitive applications like medical diagnosis or financial analysis where understanding decision rationale is crucial.

Abstract: While large language models have transformed how we interact with AI systems, they have a critical weakness: they confidently state false information that sounds entirely plausible. This “hallucination” problem has become a major barrier to using these models where accuracy matters most. Existing solutions either require retraining the entire model, add significant computational costs, or miss the root causes of why these hallucinations occur in the first place. We present CausalGuard, a new approach that combines causal reasoning with symbolic logic to catch and prevent hallucinations as they happen. Unlike previous methods that only check outputs after generation, our system understands the causal chain that leads to false statements and intervenes early in the process. CausalGuard works through two complementary paths: one that traces causal relationships between what the model knows and what it generates, and another that checks logical consistency using automated reasoning. Testing across twelve different benchmarks, we found that CausalGuard correctly identifies hallucinations 89.3% of the time while missing only 8.3% of actual hallucinations. More importantly, it reduces false claims by nearly 80% while keeping responses natural and helpful. The system performs especially well on complex reasoning tasks where multiple steps of logic are required. Because CausalGuard shows its reasoning process, it works well in sensitive areas like medical diagnosis or financial analysis where understanding why a decision was made matters as much as the decision itself.

[637] Quantifying Skill and Chance: A Unified Framework for the Geometry of Games

David H. Silver

Main category: cs.AI

TL;DR: A quantitative framework that separates skill and chance in games by modeling them as complementary sources of control over stochastic decision trees, with applications to game design and AI evaluation.

DetailsMotivation: To provide a principled method for quantifying the relative contributions of skill and luck in games, enabling objective comparisons across different game types and applications in game design and AI development.

Method: Define Skill-Luck Index S(G) in [-1, 1] by decomposing game outcomes into skill leverage K and luck leverage L, and introduce volatility Sigma to quantify outcome uncertainty over successive turns.

Result: Applied to 30 games revealing a continuum from pure chance (coin toss, S = -1) to pure skill (chess, S = +1), with poker showing moderate skill dominance (S = 0.33). Backgammon shows balanced skill and chance (S = 0).

Conclusion: The framework enables principled comparisons of player influence, game balance, and predictive stability, with broad applications in game design, AI evaluation, and risk assessment for general stochastic decision systems.

Abstract: We introduce a quantitative framework for separating skill and chance in games by modeling them as complementary sources of control over stochastic decision trees. We define the Skill-Luck Index S(G) in [-1, 1] by decomposing game outcomes into skill leverage K and luck leverage L. Applying this to 30 games reveals a continuum from pure chance (coin toss, S = -1) through mixed domains such as backgammon (S = 0, Sigma = 1.20) to pure skill (chess, S = +1, Sigma = 0). Poker exhibits moderate skill dominance (S = 0.33) with K = 0.40 +/- 0.03 and Sigma = 0.80. We further introduce volatility Sigma to quantify outcome uncertainty over successive turns. The framework extends to general stochastic decision systems, enabling principled comparisons of player influence, game balance, and predictive stability, with applications to game design, AI evaluation, and risk assessment.

[638] Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Xin Zhao, Xiaojun Chen, Bingshan Liu, Zeyao Liu, Zhendong Zhao, Xiaoyan Gu

Main category: cs.AI

TL;DR: VALOR is a zero-shot agentic framework that uses layered prompt analysis and LLM-based rewriting to make text-to-image generation safer while preserving creativity and intent.

DetailsMotivation: Current generative vision-language models can produce unsafe, offensive, or culturally inappropriate content when prompted adversarially, and existing defenses struggle to align outputs with human values without sacrificing quality or incurring high costs.

Method: VALOR integrates layered prompt analysis with human-aligned value reasoning: multi-level NSFW detection, cultural value alignment, and intention disambiguation. When unsafe content is detected, prompts are selectively rewritten by an LLM under dynamic instructions, with optional stylistic regeneration if needed.

Result: Experiments show VALOR reduces unsafe outputs by up to 100.00% while preserving prompt usefulness and creativity across adversarial, ambiguous, and value-sensitive prompts.

Conclusion: VALOR provides a scalable and effective approach for deploying safe, aligned, and helpful image generation systems in open-world settings.

Abstract: Generative vision-language models like Stable Diffusion demonstrate remarkable capabilities in creative media synthesis, but they also pose substantial risks of producing unsafe, offensive, or culturally inappropriate content when prompted adversarially. Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs. To address these challenges, we introduce VALOR (Value-Aligned LLM-Overseen Rewriter), a modular, zero-shot agentic framework for safer and more helpful text-to-image generation. VALOR integrates layered prompt analysis with human-aligned value reasoning: a multi-level NSFW detector filters lexical and semantic risks; a cultural value alignment module identifies violations of social norms, legality, and representational ethics; and an intention disambiguator detects subtle or indirect unsafe implications. When unsafe content is detected, prompts are selectively rewritten by a large language model under dynamic, role-specific instructions designed to preserve user intent while enforcing alignment. If the generated image still fails a safety check, VALOR optionally performs a stylistic regeneration to steer the output toward a safer visual domain without altering core semantics. Experiments across adversarial, ambiguous, and value-sensitive prompts show that VALOR significantly reduces unsafe outputs by up to 100.00% while preserving prompt usefulness and creativity. These results highlight VALOR as a scalable and effective approach for deploying safe, aligned, and helpful image generation systems in open-world settings.

[639] Towards autonomous quantum physics research using LLM agents with access to intelligent tools

Sören Arlt, Xuemei Gu, Mario Krenn

Main category: cs.AI

TL;DR: AI-Mandel is an LLM agent that autonomously generates and implements creative scientific ideas in quantum physics, producing concrete experiment designs ready for laboratory implementation.

DetailsMotivation: To automate scientific idea generation and implementation in one coherent system, shifting the role of humans in the scientific process and accelerating scientific discovery.

Method: AI-Mandel formulates ideas from literature and uses domain-specific AI tools to convert them into concrete experiment designs that can be directly implemented in laboratories.

Result: The system generates scientifically interesting ideas including new quantum teleportation variations, quantum network primitives in indefinite causal orders, and new geometric phase concepts. Two ideas have already led to independent scientific follow-up papers.

Conclusion: AI-Mandel demonstrates a prototypical AI physicist capable of generating actionable ideas, accelerating science while revealing concrete challenges toward achieving human-level artificial scientists.

Abstract: Artificial intelligence (AI) is used in numerous fields of science, yet the initial research questions and targets are still almost always provided by human researchers. AI-generated creative ideas in science are rare and often vague, so that it remains a human task to execute them. Automating idea generation and implementation in one coherent system would significantly shift the role of humans in the scientific process. Here we present AI-Mandel, an LLM agent that can generate and implement ideas in quantum physics. AI-Mandel formulates ideas from the literature and uses a domain-specific AI tool to turn them into concrete experiment designs that can readily be implemented in laboratories. The generated ideas by AI-Mandel are often scientifically interesting - for two of them we have already written independent scientific follow-up papers. The ideas include new variations of quantum teleportation, primitives of quantum networks in indefinite causal orders, and new concepts of geometric phases based on closed loops of quantum information transfer. AI-Mandel is a prototypical demonstration of an AI physicist that can generate and implement concrete, actionable ideas. Building such a system is not only useful to accelerate science, but it also reveals concrete open challenges on the path to human-level artificial scientists.

[640] Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

Floris Vossebeld, Shenghui Wang

Main category: cs.AI

TL;DR: A novel agentic framework using a compact 3B-parameter LLM trained via Reinforcement Learning (GRPO) to iteratively construct and debug SPARQL queries for multi-hop questions on Knowledge Graphs, achieving 49.7% accuracy on LC-QuAD 2.0.

DetailsMotivation: Current methods for SPARQL query generation lack adaptive policies to dynamically debug queries based on real-time execution feedback, hindering reliable interaction with structured data in Knowledge Graph Question Answering.

Method: An LLM agent learns a resilient policy through outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, iteratively constructing SPARQL queries and systematically recovering from execution errors using real-time feedback.

Result: Achieved 49.7% accuracy post-entity-linking on LC-QuAD 2.0, representing a 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Performance is enhanced by an explicit deliberative reasoning step.

Conclusion: The framework provides a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.

Abstract: Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn effective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent’s capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scaffold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.

[641] On the Measure of a Model: From Intelligence to Generality

Ruchira Dhar, Ninell Oldenburg, Anders Soegaard

Main category: cs.AI

TL;DR: The paper argues that AI evaluation should focus on generality rather than abstract intelligence benchmarks, showing that only generality withstands conceptual and empirical scrutiny as a stable foundation for assessing AI capabilities.

DetailsMotivation: Current intelligence benchmarks like ARC and Raven tests are misaligned with real-world utility, lack stable definitions, and fail to predict performance on practical tasks, risking misdirected AI development.

Method: Conceptual and formal analysis of three assumptions underlying intelligence-focused evaluation: generality, stability, and realism, examining which withstands scrutiny.

Result: Only generality proves conceptually and empirically sound; intelligence is not what enables generality, but rather generality should be understood as a multitask learning problem directly linking evaluation to performance breadth and reliability.

Conclusion: Evaluation should be grounded in generality rather than abstract intelligence, reframing AI progress assessment and establishing generality as a more stable foundation for evaluating capabilities across diverse and evolving tasks.

Abstract: Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.

[642] Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

Andrea Brunello, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

Main category: cs.AI

TL;DR: LLMs show strong NL-FOL translation skills when evaluated properly, with dialogue-oriented models outperforming embedding-centric ones.

DetailsMotivation: First-Order Logic (FOL) is powerful for representing natural language concepts, but converting NL to FOL has been challenging. Recent LLMs promised breakthroughs but evaluation methods may misrepresent their actual capabilities.

Method: Proposed a novel evaluation protocol to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Critically examined existing datasets and evaluation approaches.

Result: State-of-the-art dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and genuine grasp of sentence-level logic, while embedding-centric models perform markedly worse.

Conclusion: Proper evaluation reveals that modern LLMs have strong capabilities in NL-FOL translation, with dialogue-oriented models showing particularly good performance in understanding sentence-level logic semantics.

Abstract: Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs’ actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

[643] TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Wenhao Zhou, Hao Zheng, Rong Zhao

Main category: cs.AI

TL;DR: TopoPerception is a benchmark that uses topological properties to evaluate global visual perception in LVLMs, revealing that current models perform no better than random chance and more powerful models actually show worse performance.

DetailsMotivation: Current LVLMs have visual perception bottlenecks, and conventional benchmarks contain local shortcuts that overestimate models' perceptual abilities. There's a need to rigorously assess global visual perception capabilities.

Method: Developed TopoPerception benchmark leveraging topological properties that depend on global image structure and are invariant to local features, enabling shortcut-free assessment of global perception across various granularities.

Result: All evaluated state-of-the-art models performed no better than random chance at the coarsest perceptual granularity. More powerful models with stronger reasoning capabilities exhibited lower accuracy, suggesting scaling up models may exacerbate the deficit.

Conclusion: Current LVLMs have a profound inability to perceive global visual features. Merely scaling up models is insufficient and may worsen performance. New training paradigms or architectures are needed, and TopoPerception provides direction for improving global visual perception.

Abstract: Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of LVLMs. Conventional evaluation benchmarks, while rich in visual semantics, often contain unavoidable local shortcuts that can lead to an overestimation of models’ perceptual abilities. Here, we introduce TopoPerception, a benchmark that leverages topological properties to rigorously evaluate the global visual perception capabilities of LVLMs across various granularities. Since topology depends on the global structure of an image and is invariant to local features, TopoPerception enables a shortcut-free assessment of global perception, fundamentally distinguishing it from semantically rich tasks. We evaluate state-of-the-art models on TopoPerception and find that even at the coarsest perceptual granularity, all models perform no better than random chance, indicating a profound inability to perceive global visual features. Notably, a consistent trend emerge within model families: more powerful models with stronger reasoning capabilities exhibit lower accuracy. This suggests that merely scaling up models is insufficient to address this deficit and may even exacerbate it. Progress may require new training paradigms or architectures. TopoPerception not only exposes a critical bottleneck in current LVLMs but also offers a lens and direction for improving their global visual perception. The data and code are publicly available at: https://github.com/Wenhao-Zhou/TopoPerception.

[644] End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

Xi Li, Nicholas Matsumoto, Ujjwal Pasupulety, Atharva Deo, Cherine Yang, Jay Moran, Miguel E. Hernandez, Peter Wager, Jasmine Lin, Jeanine Kim, Alvin C. Goh, Christian Wagner, Geoffrey A. Sonn, Andrew J. Hung

Main category: cs.AI

TL;DR: F2O is an end-to-end system that analyzes surgical videos to detect gestures and predict postoperative outcomes, achieving performance comparable to human annotations.

DetailsMotivation: Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remains challenging, requiring automated systems to uncover patterns in surgical procedures.

Method: Leverages transformer-based spatial and temporal modeling with frame-wise classification to detect consecutive short (~2 seconds) gestures in robot-assisted radical prostatectomy videos.

Result: Achieved AUC of 0.80 frame-level and 0.81 video-level for gesture detection. F2O-derived features predicted postoperative outcomes with 0.79 accuracy (vs 0.75 human annotations), with strong correlation (r=0.96) and captured key patterns linked to erectile function recovery.

Conclusion: F2O enables automatic interpretable assessment and establishes a foundation for data-driven surgical feedback and prospective clinical decision support.

Abstract: Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences ( 0.07), and strong correlation (r = 0.96, p < 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.

[645] Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

Shizhou Xu, Yuan Ni, Stefan Broecker, Thomas Strohmer

Main category: cs.AI

TL;DR: Forgetting-MarI is an LLM unlearning framework that provably removes only marginal information from data to be forgotten while preserving retained data information, ensuring better model performance preservation than existing methods.

DetailsMotivation: Address the need for selective data removal from trained AI models for privacy protection and regulatory compliance, especially for resource-intensive LLMs, without degrading overall model performance.

Method: Penalizes marginal information contributed by data to be unlearned, providing explicit upper bounds on residual influence and provable undetectability of unlearned data.

Result: Outperforms state-of-the-art unlearning methods with reliable forgetting and better preserved general model performance across diverse benchmarks.

Conclusion: Represents an important step toward making AI systems more controllable and compliant with privacy/copyright regulations without compromising effectiveness.

Abstract: As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ‘‘forget’’ specific data. We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset’s residual influence in the trained models, providing provable undetectability. Extensive experiments confirm that our approach outperforms current state-of-the-art unlearning methods, delivering reliable forgetting and better preserved general model performance across diverse benchmarks. This advancement represents an important step toward making AI systems more controllable and compliant with privacy and copyright regulations without compromising their effectiveness.

[646] An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR

Sinan Urgun, Seçkin Arı

Main category: cs.AI

TL;DR: Systematic evaluation of four LLMs (GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, Llama-3.3-70b) on abstract visual reasoning using RAVEN-FAIR dataset with four reasoning architectures, showing GPT-4.1-Mini consistently performs best.

DetailsMotivation: To systematically evaluate LLM performance in abstract visual reasoning problems and understand how different reasoning architectures affect model capabilities.

Method: Used four LLM models with four reasoning architectures (single-shot, embedding-controlled repetition, self-reflection, multi-agent) on RAVEN-FAIR dataset. Visual responses generated via three-stage process (JSON extraction, LLM reasoning, Tool Function) and evaluated using SSIM and LPIPS metrics. Analyzed Chain-of-Thought scores and error types.

Result: GPT-4.1-Mini consistently achieved highest overall accuracy across all architectures. Multi-agent architecture occasionally altered semantic/numeric balance but effects weren’t uniformly beneficial. Each model showed distinct sensitivity to architectural design. Response coverage variations complicated cross-architecture comparisons.

Conclusion: Reasoning effectiveness remains model-specific, with GPT-4.1-Mini demonstrating strongest capabilities. Multi-run evaluations are essential as single-run assessments are fragile and unreliable for drawing conclusions about LLM performance.

Abstract: This study aims to systematically evaluate the performance of large language models (LLMs) in abstract visual reasoning problems. We examined four LLM models (GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, Llama-3.3-70b) utilizing four different reasoning architectures (single-shot, embedding-controlled repetition, self-reflection, and multi-agent) on the RAVEN-FAIR dataset. Visual responses generated through a three-stage process (JSON extraction, LLM reasoning, and Tool Function) were evaluated using SSIM and LPIPS metrics; Chain-of-Thought scores and error types (semantic hallucination, numeric misperception) were analyzed. Results demonstrate that GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures, indicating a strong reasoning capability. While the multi-agent architecture occasionally altered semantic and numeric balance across models, these effects were not uniformly beneficial. Instead, each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific. Variations in response coverage further emerged as a confounding factor that complicates direct cross-architecture comparison. To estimate the upper-bound performance of each configuration, we report the best of five independent runs, representing a best-case scenario rather than an averaged outcome. This multi-run strategy aligns with recent recommendations, which emphasize that single-run evaluations are fragile and may lead to unreliable conclusions.

[647] Looking Forward: Challenges and Opportunities in Agentic AI Reliability

Liudong Xing, Janet, Lin

Main category: cs.AI

TL;DR: This chapter discusses challenges and future directions for building reliable agentic AI systems, focusing on mitigating cascading failures, dynamic environments, inconsistent task execution, unpredictable behaviors, and resource-intensive reliability mechanisms.

DetailsMotivation: To address the growing need for reliable AI systems, particularly agentic AI systems that can operate autonomously in complex environments, by identifying key research challenges and opportunities.

Method: The chapter presents perspectives and discusses open research problems through analysis of various aspects including cascading failures, dynamic environments, task execution consistency, emergent behaviors, and reliability mechanisms.

Result: Identification of multiple research challenges and opportunities in building reliable agentic AI systems, with specific focus on testing and evaluation methods for reliability assessment.

Conclusion: There are significant research gaps and opportunities in developing reliable agentic AI systems, particularly in areas of failure mitigation, dynamic adaptation, and comprehensive testing frameworks.

Abstract: This chapter presents perspectives for challenges and future development in building reliable AI systems, particularly, agentic AI systems. Several open research problems related to mitigating the risks of cascading failures are discussed. The chapter also sheds lights on research challenges and opportunities in aspects including dynamic environments, inconsistent task execution, unpredictable emergent behaviors, as well as resource-intensive reliability mechanisms. In addition, several research directions along the line of testing and evaluating reliability of agentic AI systems are also discussed.

[648] A Neuromorphic Architecture for Scalable Event-Based Control

Yongkang Huo, Fulvio Forni, Rodolphe Sepulchre

Main category: cs.AI

TL;DR: The paper introduces the “rebound Winner-Take-All (RWTA)” motif as a scalable neuromorphic control architecture that combines discrete computation reliability with continuous regulation tunability.

DetailsMotivation: To create a unified neuromorphic control architecture that addresses both continuous rhythmic generation and discrete decision-making in a single framework, bridging the gap between discrete state machines and continuous biophysical circuits.

Method: Developed the RWTA motif as the basic building block, creating an event-based framework that inherits discrete computation capabilities from winner-take-all state machines and continuous tuning capabilities from excitable biophysical circuits.

Result: The architecture demonstrates versatility, robustness, and modularity through its application in designing the nervous system of a snake robot, successfully handling both rhythmic generation and decision-making tasks.

Conclusion: The RWTA-based neuromorphic architecture provides a scalable solution that unifies discrete and continuous control paradigms, offering a promising approach for neuromorphic system design with applications in robotics and beyond.

Abstract: This paper introduces the ``rebound Winner-Take-All (RWTA)" motif as the basic element of a scalable neuromorphic control architecture. From the cellular level to the system level, the resulting architecture combines the reliability of discrete computation and the tunability of continuous regulation: it inherits the discrete computation capabilities of winner-take-all state machines and the continuous tuning capabilities of excitable biophysical circuits. The proposed event-based framework addresses continuous rhythmic generation and discrete decision-making in a unified physical modeling language. We illustrate the versatility, robustness, and modularity of the architecture through the nervous system design of a snake robot.

[649] Multi-agent Self-triage System with Medical Flowcharts

Yujia Liu, Sophia Yu, Hongyue Jin, Jessica Wen, Alexander Qian, Terrence Lee, Mattheus Ramsis, Gi Won Choi, Lianhui Qin, Xin Liu, Edward J. Wang

Main category: cs.AI

TL;DR: A conversational self-triage system that guides LLMs with 100 clinically validated AMA flowcharts achieves high accuracy in medical decision support through a multi-agent framework.

DetailsMotivation: Online health resources and LLMs have limited reliability in healthcare due to low accuracy, lack of transparency, and susceptibility to unverified information.

Method: Multi-agent framework with retrieval agent, decision agent, and chat agent that uses 100 AMA flowcharts to guide LLMs through structured, auditable patient decision support.

Result: 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles (N=37,200).

Conclusion: Combining free-text interaction with standardized clinical protocols enables transparent, accurate, and generalizable AI-assisted self-triage for informed patient decision-making.

Abstract: Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision-making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof-of-concept conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free-text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, with potential to support informed patient decision-making while improving healthcare resource utilization.

[650] Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes

Mohammed Temraz, Mark T Keane

Main category: cs.AI

TL;DR: CFA-SMOTE is a novel data augmentation method that combines counterfactual generation from XAI with SMOTE to handle climate outlier events by treating them as class-imbalance problems.

DetailsMotivation: AI struggles with climate-disrupted data because traditional ML methods rely on historical distributions and fail to handle out-of-distribution outlier events, which are crucial for climate change adaptation.

Method: CFA-SMOTE combines instance-based counterfactual methods from Explainable AI with SMOTE class-imbalance technique to create synthetic data-points representing climate outlier events.

Result: Comparative experiments show CFA-SMOTE improves predictive performance for grass growth prediction during the 2018 European drought crisis, outperforming benchmark methods under different class-imbalance ratios.

Conclusion: Treating climate change prediction as a class-imbalance problem and using counterfactual-based data augmentation can significantly improve AI’s ability to handle climate-disrupted data and extreme weather events.

Abstract: In recent years, humanity has begun to experience the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate-disrupted data; specifically, machine learning methods that work from historical data-distributions, are not good at handling out-of-distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class-imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority-class instances of “climate outlier events”. This novel data augmentation method – called Counterfactual-Based SMOTE (CFA-SMOTE) – combines an instance-based counterfactual method from Explainable AI (XAI) with the well-known class-imbalance method, SMOTE. CFA-SMOTE creates synthetic data-points representing outlier, climate-events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA-SMOTE method, comparing it to benchmark counterfactual and class-imbalance methods under different conditions (i.e., class-imbalance ratios). The focal climate-change domain used relies on predicting grass growth on Irish dairy farms, during Europe-wide drought and forage crisis of 2018.

[651] Adaptively Coordinating with Novel Partners via Learned Latent Strategies

Benjamin Li, Shuyang Shi, Lucia Romero, Huao Li, Yaqi Xie, Woojun Kim, Stefanos Nikolaidis, Michael Lewis, Katia Sycara, Simon Stepputtis

Main category: cs.AI

TL;DR: A strategy-conditioned cooperator framework for real-time adaptation to diverse partner strategies in human-agent teams, using variational autoencoders for strategy representation and clustering for strategy categorization.

DetailsMotivation: Artificial agents need to adapt to human partners' unique and dynamically changing preferences in time-pressured collaborative tasks with complex strategic spaces.

Method: Learn latent strategy space with variational autoencoder from trajectory data, cluster strategies, train cooperator conditioned on clusters, and use fixed-share regret minimization for online adaptation to novel partners.

Result: Achieves state-of-the-art performance with novel human and agent teammates in Overcooked domain, validated through experiments and user study.

Conclusion: The proposed framework effectively enables real-time adaptation to diverse partner strategies in complex collaborative environments.

Abstract: Adaptation is the cornerstone of effective collaboration among heterogeneous team members. In human-agent teams, artificial agents need to adapt to their human partners in real time, as individuals often have unique preferences and policies that may change dynamically throughout interactions. This becomes particularly challenging in tasks with time pressure and complex strategic spaces, where identifying partner behaviors and selecting suitable responses is difficult. In this work, we introduce a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a broad range of potential partner strategies in real-time. Our approach encodes strategies with a variational autoencoder to learn a latent strategy space from agent trajectory data, identifies distinct strategy types through clustering, and trains a cooperator agent conditioned on these clusters by generating partners of each strategy type. For online adaptation to novel partners, we leverage a fixed-share regret minimization algorithm that dynamically infers and adjusts the partner’s strategy estimation during interaction. We evaluate our method in a modified version of the Overcooked domain, a complex collaborative cooking environment that requires effective coordination among two players with a diverse potential strategy space. Through these experiments and an online user study, we demonstrate that our proposed agent achieves state of the art performance compared to existing baselines when paired with novel human, and agent teammates.

[652] LLM-Assisted Formalization Enables Deterministic Detection of Statutory Inconsistency in the Internal Revenue Code

Borchuluun Yadamsuren, Steven Keith Platt, Miguel Diaz

Main category: cs.AI

TL;DR: A hybrid neuro-symbolic framework combining LLMs with Prolog logic programming achieves deterministic detection of statutory inconsistencies in complex tax law, outperforming LLM-only approaches.

DetailsMotivation: To address the limitations of LLMs in handling hierarchical processing and deep structured reasoning for statutory inconsistency detection in complex legal domains like tax law.

Method: Combined GPT-4o and GPT-5 with Prolog symbolic logic - used LLMs to translate tax code sections into Prolog rules, then employed Prolog for deterministic reasoning and inconsistency detection.

Result: LLM-only approaches achieved only 33% accuracy, while the hybrid Prolog model produced deterministic, reproducible results and successfully identified inconsistency zones with validation confirming accuracy and internal consistency.

Conclusion: LLM-assisted formalization anchored in symbolic logic enables transparent and reliable statutory inconsistency detection, demonstrating the value of hybrid neuro-symbolic approaches for complex legal reasoning.

Abstract: This study introduces a hybrid neuro-symbolic framework that achieves deterministic detection of statutory inconsistency in complex law. We use the U.S. Internal Revenue Code (IRC) as a case study because its complexity makes it a fertile domain for identifying conflicts. Our research offers a solution for detecting inconsistent provisions by combining Large Language Models (LLMs) with symbolic logic. LLM-based methods can support compliance, fairness, and statutory drafting, yet tax-specific applications remain sparse. A key challenge is that such models struggle with hierarchical processing and deep structured reasoning, especially over long text. This research addresses these gaps through experiments using GPT-4o, GPT-5, and Prolog. GPT-4o was first used to translate Section 121 into Prolog rules and refine them in SWISH. These rules were then incorporated into prompts to test whether Prolog-augmented prompting improved GPT-4o’s inconsistency detection. GPT-4o, whether prompted with natural language alone or with Prolog augmentation, detected the inconsistency in only one of three strategies (33 percent accuracy), but its reasoning quality differed: natural-language prompting achieved 100 percent rule coverage, while Prolog-augmented prompting achieved 66 percent, indicating more incomplete statutory analysis. In contrast to probabilistic prompting, the hybrid Prolog model produced deterministic and reproducible results. Guided by GPT-5 for refinement, the model formalized the IRC section’s competing interpretations and successfully detected an inconsistency zone. Validation tests confirm that the Prolog implementation is accurate, internally consistent, deterministic, and capable of autonomously identifying inconsistencies. These findings show that LLM-assisted formalization, anchored in symbolic logic, enables transparent and reliable statutory inconsistency detection.

[653] Improving Autoformalization Using Direct Dependency Retrieval

Shaoqi Wang, Lu Yu, Chunjie Yang

Main category: cs.AI

TL;DR: Proposes DDR (Direct Dependency Retrieval) framework for statement autoformalization that directly generates and verifies formal library dependencies from natural language math descriptions, achieving superior performance over existing methods.

DetailsMotivation: Address limitations in existing autoformalization methods including lack of contextual awareness causing hallucinations, poor precision/recall in dependency retrieval, and scalability issues with growing datasets.

Method: DDR framework that generates candidate library dependencies from natural language, verifies them via efficient suffix array checks, constructs large retrieval dataset (500k+ samples), and fine-tunes high-precision DDR model.

Result: DDR model significantly outperforms SOTA methods in retrieval precision and recall. Autoformalizer with DDR shows consistent advantages in single-attempt accuracy and multi-attempt stability over traditional RAG methods.

Conclusion: DDR framework effectively bridges the gap in statement autoformalization by providing scalable, high-precision dependency retrieval that enhances autoformalization performance and stability.

Abstract: The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.

[654] Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen

Main category: cs.AI

TL;DR: Chain-of-Evidence paradigm for visual document RAG that combines reasoning with visual evidence attribution using bounding boxes and page indexes, trained via reinforcement learning for verifiable reasoning paths.

DetailsMotivation: Existing methods for visual evidence attribution in VD-RAG lack fine-grained supervision and progressive traceability throughout the reasoning process, making answer verification challenging.

Method: Proposed Look As You Think (LAT) framework using reinforcement learning to train models to produce evidence-grounded reasoning paths with consistent attribution, evaluating attribution consistency and rewarding only correct CoE trajectories.

Result: LAT improved vanilla Qwen2.5-VL-7B-Instruct by 8.23% in soft EM and 47.0% in IoU@0.5, outperforming supervised fine-tuning baselines and showing stronger cross-domain generalization.

Conclusion: The Chain-of-Evidence paradigm with LAT training enables verifiable reasoning in visual document RAG, achieving better performance and generalization than existing methods.

Abstract: Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.

[655] Adaptive Diagnostic Reasoning Framework for Pathology with Multimodal Large Language Models

Yunqi Hong, Johnson Kao, Liam Edwards, Nein-Tzu Liu, Chung-Yen Huang, Alex Oliveira-Kowaleski, Cho-Jui Hsieh, Neil Y. C. Lin

Main category: cs.AI

TL;DR: RECAP-PATH is an interpretable AI framework that enables multimodal large language models to perform evidence-linked diagnostic reasoning in pathology, achieving improved accuracy and clinically trustworthy rationales without requiring white-box access or weight updates.

DetailsMotivation: Current AI tools in pathology lack human-readable reasoning needed to audit decisions and prevent errors, limiting clinical adoption despite improvements in screening throughput and prognostic pattern recognition.

Method: A two-phase self-learning process: diversification expands pathology-style explanations, and optimization refines them for accuracy. Uses only small labeled sets without white-box access or weight updates to generate cancer diagnoses.

Result: Evaluated on breast and prostate datasets, RECAP-PATH produced rationales aligned with expert assessment and delivered substantial gains in diagnostic accuracy over baseline methods.

Conclusion: RECAP-PATH provides clinically trustworthy AI by uniting visual understanding with reasoning, demonstrating a generalizable path toward evidence-linked interpretation in pathology.

Abstract: AI tools in pathology have improved screening throughput, standardized quantification, and revealed prognostic patterns that inform treatment. However, adoption remains limited because most systems still lack the human-readable reasoning needed to audit decisions and prevent errors. We present RECAP-PATH, an interpretable framework that establishes a self-learning paradigm, shifting off-the-shelf multimodal large language models from passive pattern recognition to evidence-linked diagnostic reasoning. At its core is a two-phase learning process that autonomously derives diagnostic criteria: diversification expands pathology-style explanations, while optimization refines them for accuracy. This self-learning approach requires only small labeled sets and no white-box access or weight updates to generate cancer diagnoses. Evaluated on breast and prostate datasets, RECAP-PATH produced rationales aligned with expert assessment and delivered substantial gains in diagnostic accuracy over baselines. By uniting visual understanding with reasoning, RECAP-PATH provides clinically trustworthy AI and demonstrates a generalizable path toward evidence-linked interpretation.

[656] Intelligent Collaborative Optimization for Rubber Tyre Film Production Based on Multi-path Differentiated Clipping Proximal Policy Optimization

Yinghao Ruan, Wei Pang, Shuaihao Liu, Huili Yang, Leyi Han, Xinghui Dong

Main category: cs.AI

TL;DR: Introduces MPD-PPO, a deep reinforcement learning algorithm with multi-branch policy architecture for high-dimensional optimization in rubber tyre manufacturing, showing improved accuracy and efficiency in film production control.

DetailsMotivation: Address limitations of traditional centralized scheduling and inflexible production lines in rubber tyre industry, particularly for dynamic production demands and complex subsystem coordination.

Method: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO) algorithm with multi-branch policy architecture and differentiated gradient clipping constraints for stable high-dimensional policy updates.

Result: Demonstrated substantial improvements in tuning accuracy and operational efficiency for width and thickness control in rubber tyre film production, successfully handling high dimensionality and multi-objective trade-offs.

Conclusion: The framework effectively tackles key challenges in tyre manufacturing including high dimensionality, multi-objective optimization, and dynamic adaptation, providing enhanced performance and production stability for real-time industrial deployment.

Abstract: The advent of smart manufacturing is addressing the limitations of traditional centralized scheduling and inflexible production line configurations in the rubber tyre industry, especially in terms of coping with dynamic production demands. Contemporary tyre manufacturing systems form complex networks of tightly coupled subsystems pronounced nonlinear interactions and emergent dynamics. This complexity renders the effective coordination of multiple subsystems, posing an essential yet formidable task. For high-dimensional, multi-objective optimization problems in this domain, we introduce a deep reinforcement learning algorithm: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO). This algorithm employs a multi-branch policy architecture with differentiated gradient clipping constraints to ensure stable and efficient high-dimensional policy updates. Validated through experiments on width and thickness control in rubber tyre film production, MPD-PPO demonstrates substantial improvements in both tuning accuracy and operational efficiency. The framework successfully tackles key challenges, including high dimensionality, multi-objective trade-offs, and dynamic adaptation, thus delivering enhanced performance and production stability for real-time industrial deployment in tyre manufacturing.

[657] MedDCR: Learning to Design Agentic Workflows for Medical Coding

Jiyang Zheng, Islam Nassar, Thanh Vu, Xu Zhong, Yang Lin, Tongliang Liu, Long Duong, Yuan-Fang Li

Main category: cs.AI

TL;DR: MedDCR is a closed-loop framework that learns effective workflows for medical coding through iterative refinement, outperforming state-of-the-art methods by treating workflow design as a learning problem.

DetailsMotivation: Existing medical coding systems rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world clinical documentation, creating a need for systematic learning of effective workflows.

Method: A closed-loop framework with three components: Designer proposes workflows, Coder executes them, and Reflector evaluates predictions and provides feedback, with a memory archive preserving prior designs for reuse and iterative refinement.

Result: MedDCR outperforms state-of-the-art baselines on benchmark datasets and produces interpretable, adaptable workflows that better reflect real coding practice.

Conclusion: The framework improves both reliability and trustworthiness of automated medical coding systems by enabling systematic learning of effective workflows through iterative refinement.

Abstract: Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi-step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross-document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed-loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state-of-the-art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.

[658] Bayesian Optimization in Language Space: An Eval-Efficient AI Self-Improvement Framework

Enoch Hyunwook Kang, Hema Yoganarasimhan

Main category: cs.AI

TL;DR: The paper proposes T-BoN BO, a text-based Bayesian optimization framework that combines Best-of-N selection with textual gradients to achieve evaluation-efficient AI self-improvement, particularly for applications where evaluation is more costly than generation.

DetailsMotivation: Current self-improving AI focuses on query efficiency, but many real-world applications face evaluation bottlenecks where human feedback is expensive and time-consuming. The goal is to optimize for evaluation efficiency rather than just generation efficiency.

Method: The paper proves that Best-of-N selection with textual gradients statistically emulates UCB acquisition function gradients, enabling optimal exploration. Based on this, they propose T-BoN BO - TextGrad-Best-of-N Bayesian Optimization for language-space optimization.

Result: Empirical validation on automated ad alignment tasks for persona distribution shows T-BoN BO outperforms state-of-the-art baselines in evaluation efficiency.

Conclusion: T-BoN BO provides a simple yet effective framework for evaluation-efficient AI self-improvement by bridging Bayesian optimization principles with language model capabilities through textual gradients and Best-of-N selection.

Abstract: Large Language Models (LLMs) have recently enabled self-improving AI, i.e., AI that iteratively generates, evaluates, and refines its own outcomes. Recent studies have shown that self-improving AI focusing on prompt optimization can outperform state-of-the-art reinforcement-learning fine-tuned LLMs. Here, their `performance’ is typically measured by query efficiency - the number of LLM-generated solution samples required to meet a certain performance threshold. However, in many societal applications, the primary limitation is not generating new solutions but evaluating them. For instance, evaluating an ad’s effectiveness requires significant human feedback, which is far more costly and time-consuming than generating a candidate ad. To optimize for the evaluation efficiency objective, a natural approach is to extend Bayesian Optimization (BO), a framework proven optimal for evaluation efficiency, to the language domain. However, the difficulty of directly estimating suitable acquisition functions in LLMs’ minds makes this extension challenging. This paper overcomes this challenge by proving that the combination of the simple and widely used Best-of-N selection strategy and simple textual gradients (i.e., textual edits from a critic model) statistically emulates the behavior of the gradients on the canonical UCB acquisition function, which induces optimal exploration in terms of evaluation efficiency. Based on this result, we propose TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), a simple and eval-efficient language-space Bayesian optimization framework for AI self-improvement. We also empirically validate T-BoN BO by applying it to automated ad alignment tasks for persona distribution, demonstrating its superior performance compared to popular state-of-the-art baselines.

[659] No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding

Yanchang Fu, Shengda Liu, Pei Xu, Kaiqi Huang

Main category: cs.AI

TL;DR: Embedding CFR algorithm uses low-dimensional continuous embedding space for information set abstraction in imperfect-information games, achieving faster exploitability convergence than cluster-based methods.

DetailsMotivation: Traditional discrete clustering methods for game abstraction lose critical subtle differences between information sets, compromising strategy solving quality in large-scale imperfect-information games like poker.

Method: Pre-trains information set features into low-dimensional continuous embedding space using word embedding paradigm, then performs CFR with regret accumulation and strategy updates in this embedding space.

Result: Experiments show significantly faster exploitability convergence compared to cluster-based abstraction algorithms with same spatial overhead, particularly effective in poker games.

Conclusion: Embedding CFR is the first algorithm to pre-train information set abstractions through low-dimensional embedding for strategy solving, providing more precise capture of distinctions and connections between information sets.

Abstract: High-quality information set abstraction remains a core challenge in solving large-scale imperfect-information extensive-form games (IIEFGs)-such as no-limit Texas Hold’em-where the finite nature of spatial resources hinders strategy solving over the full game. State-of-the-art AI methods rely on pre-trained discrete clustering for abstraction, yet their hard classification irreversibly loses critical information: specifically, the quantifiable subtle differences between information sets-vital for strategy solving-thereby compromising the quality of such solving. Inspired by the word embedding paradigm in natural language processing, this paper proposes the Embedding CFR algorithm, a novel approach for solving strategies in IIEFGs within an embedding space. The algorithm pre-trains and embeds features of isolated information sets into an interconnected low-dimensional continuous space, where the resulting vectors more precisely capture both the distinctions and connections between information sets. Embedding CFR presents a strategy-solving process driven by regret accumulation and strategy updates within this embedding space, with accompanying theoretical analysis verifying its capacity to reduce cumulative regret. Experiments on poker show that with the same spatial overhead, Embedding CFR achieves significantly faster exploitability convergence compared to cluster-based abstraction algorithms, confirming its effectiveness. Furthermore, to our knowledge, it is the first algorithm in poker AI that pre-trains information set abstractions through low-dimensional embedding for strategy solving.

[660] KrwEmd: Revising the Imperfect-Recall Abstraction from Forgetting Everything

Yanchang Fu, Qiyue Yin, Shengda Liu, Pei Xu, Kaiqi Huang

Main category: cs.AI

TL;DR: KrwEmd is a novel algorithm that addresses excessive abstraction in imperfect-information games by using k-recall winrate features and earth mover’s distance clustering to improve AI performance.

DetailsMotivation: Excessive abstraction in hand abstraction tasks (like Texas hold'em) impairs AI performance due to extreme imperfect-recall abstraction that discards historical information.

Method: Developed KrwEmd algorithm using k-recall winrate features that leverage both future and historical game information, then clustering signal observation infosets using earth mover’s distance.

Result: Experimental results show KrwEmd significantly improves AI gameplay performance compared to existing algorithms.

Conclusion: KrwEmd effectively addresses the excessive abstraction problem in imperfect-information games and enhances AI performance through better information retention and clustering.

Abstract: Excessive abstraction is a critical challenge in hand abstraction-a task specific to games like Texas hold’em-when solving large-scale imperfect-information games, as it impairs AI performance. This issue arises from extreme implementations of imperfect-recall abstraction, which entirely discard historical information. This paper presents KrwEmd, the first practical algorithm designed to address this problem. We first introduce the k-recall winrate feature, which not only qualitatively distinguishes signal observation infosets by leveraging both future and, crucially, historical game information, but also quantitatively captures their similarity. We then develop the KrwEmd algorithm, which clusters signal observation infosets using earth mover’s distance to measure discrepancies between their features. Experimental results demonstrate that KrwEmd significantly improves AI gameplay performance compared to existing algorithms.

[661] MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization

Lanxue Zhang, Yuqiang Xie, Fang Fang, Fanglong Dong, Rui Liu, Yanan Cao

Main category: cs.AI

TL;DR: The paper proposes a comprehensive solution to address catastrophic forgetting in smaller language models during knowledge distillation from large models, using a specially constructed dataset with metacognitive knowledge and a novel training method called GDPO.

DetailsMotivation: Existing datasets and fine-tuning approaches for compressing large language models into smaller ones face catastrophic forgetting issues, especially for models smaller than 8B, due to ignoring the relationship between training data knowledge and model's inherent abilities, and conventional training objectives failing to constrain knowledge preservation.

Method: Constructed a 5K-instance dataset covering multiple reasoning tasks with metacognitive knowledge annotations, filtered based on task knowledge and model skills. Introduced GDPO (Group Direction Preference Optimization) that efficiently approximates GRPO performance for resource-limited scenarios, using large model guidance and implicit optimization path constraints through a reference model.

Result: Extensive experiments demonstrate that the proposed approach significantly alleviates catastrophic forgetting and improves reasoning performance on smaller models.

Conclusion: The comprehensive solution combining carefully constructed dataset with metacognitive knowledge and GDPO training method effectively addresses catastrophic forgetting in knowledge distillation to smaller language models.

Abstract: Large Language Models demonstrate strong reasoning capabilities, which can be effectively compressed into smaller models. However, existing datasets and fine-tuning approaches still face challenges that lead to catastrophic forgetting, particularly for models smaller than 8B. First, most datasets typically ignore the relationship between training data knowledge and the model’s inherent abilities, making it difficult to preserve prior knowledge. Second, conventional training objectives often fail to constrain inherent knowledge preservation, which can result in forgetting of previously learned skills. To address these issues, we propose a comprehensive solution that alleviates catastrophic forgetting from both the data and fine-tuning approach perspectives. On the data side, we construct a dataset of 5K instances that covers multiple reasoning tasks and incorporates metacognitive knowledge, making it more tolerant and effective for distillation into smaller models. We annotate the metacognitive knowledge required to solve each question and filter the data based on task knowledge and the model’s inherent skills. On the training side, we introduce GDPO (Group Direction Preference Optimization), which is better suited for resource-limited scenarios and can efficiently approximate the performance of GRPO. Guided by the large model and by implicitly constraining the optimization path through a reference model, GDPO enables more effective knowledge transfer from the large model and constrains excessive parameter drift. Extensive experiments demonstrate that our approach significantly alleviates catastrophic forgetting and improves reasoning performance on smaller models.

[662] RTMol: Rethinking Molecule-text Alignment in a Round-trip View

Letian Chen, Runhan Shi, Gufeng Yu, Yang Yang

Main category: cs.AI

TL;DR: RTMol is a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning, addressing limitations of existing methods and improving alignment performance by up to 47%.

DetailsMotivation: Existing molecular sequence representation methods treat molecular captioning and text-based molecular design as separate tasks, facing limitations in chemical accuracy, ambiguous training data, and bidirectional inconsistency between generation directions.

Method: Proposes RTMol framework using self-supervised round-trip learning that unifies molecular captioning and text-to-SMILES generation, with novel round-trip evaluation metrics and unsupervised training for molecular captioning without paired molecule-text corpora.

Result: RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.

Conclusion: RTMol provides a unified bidirectional alignment framework that overcomes key limitations of existing methods and significantly improves molecule-text alignment performance through self-supervised round-trip learning.

Abstract: Aligning molecular sequence representations (e.g., SMILES notations) with textual descriptions is critical for applications spanning drug discovery, materials design, and automated chemical literature analysis. Existing methodologies typically treat molecular captioning (molecule-to-text) and text-based molecular design (text-to-molecule) as separate tasks, relying on supervised fine-tuning or contrastive learning pipelines. These approaches face three key limitations: (i) conventional metrics like BLEU prioritize linguistic fluency over chemical accuracy, (ii) training datasets frequently contain chemically ambiguous narratives with incomplete specifications, and (iii) independent optimization of generation directions leads to bidirectional inconsistency. To address these issues, we propose RTMol, a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning. The framework introduces novel round-trip evaluation metrics and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora. Experiments demonstrate that RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.

[663] Incremental Maintenance of DatalogMTL Materialisations

Kaiyue Zhao, Dingqi Chen, Shaoyu Wang, Pan Hu

Main category: cs.AI

TL;DR: DRedMTL is an incremental reasoning algorithm for DatalogMTL that efficiently handles dynamic updates by extending the classical DRed algorithm to work with periodic interval representations.

DetailsMotivation: Existing DatalogMTL reasoning approaches lack support for efficient dynamic updates, which is crucial for real-world applications with frequent data updates.

Method: Extends the classical DRed algorithm with specifically designed operators to handle periodic representations of DatalogMTL materialisations with bounded intervals.

Result: Experimental results show DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude on publicly available datasets.

Conclusion: DRedMTL provides an efficient incremental reasoning solution for DatalogMTL that addresses the dynamic update requirements of real-world temporal data applications.

Abstract: DatalogMTL extends the classical Datalog language with metric temporal logic (MTL), enabling expressive reasoning over temporal data. While existing reasoning approaches, such as materialisation based and automata based methods, offer soundness and completeness, they lack support for handling efficient dynamic updates, a crucial requirement for real-world applications that involve frequent data updates. In this work, we propose DRedMTL, an incremental reasoning algorithm for DatalogMTL with bounded intervals. Our algorithm builds upon the classical DRed algorithm, which incrementally updates the materialisation of a Datalog program. Unlike a Datalog materialisation which is in essence a finite set of facts, a DatalogMTL materialisation has to be represented as a finite set of facts plus periodic intervals indicating how the full materialisation can be constructed through unfolding. To cope with this, our algorithm is equipped with specifically designed operators to efficiently handle such periodic representations of DatalogMTL materialisations. We have implemented this approach and tested it on several publicly available datasets. Experimental results show that DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude.

[664] Debate over Mixed-knowledge: A Robust Multi-Agent Framework for Incomplete Knowledge Graph Question Answering

Jilong Liu, Pengyang Shao, Wei Qin, Fei Liu, Yonghui Yang, Richang Hong

Main category: cs.AI

TL;DR: Proposes DoM framework for incomplete KGQA using multi-agent debate to dynamically integrate structured and unstructured knowledge, plus a new realistic dataset based on real-world knowledge updates.

DetailsMotivation: Real-world KGs are often incomplete, and existing IKGQA methods lack adaptive fusion of multiple knowledge sources, failing to exploit their complementary strengths. Current datasets also don't realistically simulate knowledge incompleteness.

Method: DoM framework using multi-agent debate with specialized agents for KG and RAG inference, coordinated through iterative interaction. Decomposes questions, retrieves evidence via dual agents, and uses judge agent to evaluate and aggregate answers.

Result: DoM consistently outperforms state-of-the-art baselines in extensive experiments, demonstrating improved robustness to KG incompleteness through effective knowledge complementarity.

Conclusion: The proposed DoM framework effectively addresses IKGQA by dynamically integrating multiple knowledge sources through multi-agent collaboration, and the new dataset provides a more realistic benchmark for evaluating IKGQA methods.

Abstract: Knowledge Graph Question Answering (KGQA) aims to improve factual accuracy by leveraging structured knowledge. However, real-world Knowledge Graphs (KGs) are often incomplete, leading to the problem of Incomplete KGQA (IKGQA). A common solution is to incorporate external data to fill knowledge gaps, but existing methods lack the capacity to adaptively and contextually fuse multiple sources, failing to fully exploit their complementary strengths. To this end, we propose Debate over Mixed-knowledge (DoM), a novel framework that enables dynamic integration of structured and unstructured knowledge for IKGQA. Built upon the Multi-Agent Debate paradigm, DoM assigns specialized agents to perform inference over knowledge graphs and external texts separately, and coordinates their outputs through iterative interaction. It decomposes the input question into sub-questions, retrieves evidence via dual agents (KG and Retrieval-Augmented Generation, RAG), and employs a judge agent to evaluate and aggregate intermediate answers. This collaboration exploits knowledge complementarity and enhances robustness to KG incompleteness. In addition, existing IKGQA datasets simulate incompleteness by randomly removing triples, failing to capture the irregular and unpredictable nature of real-world knowledge incompleteness. To address this, we introduce a new dataset, Incomplete Knowledge Graph WebQuestions, constructed by leveraging real-world knowledge updates. These updates reflect knowledge beyond the static scope of KGs, yielding a more realistic and challenging benchmark. Through extensive experiments, we show that DoM consistently outperforms state-of-the-art baselines.

[665] Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu

Main category: cs.AI

TL;DR: MAKGED is a multi-agent framework using multiple LLMs for knowledge graph error detection, combining fine-grained subgraph embeddings with LLM queries to improve accuracy and transparency.

DetailsMotivation: Existing error detection methods fail to utilize fine-grained subgraph information effectively, rely on fixed graph structures, and lack transparency in decision-making, leading to suboptimal performance.

Method: Proposes MAKGED framework that concatenates fine-grained bidirectional subgraph embeddings with LLM-based query embeddings to create four specialized agents that engage in multi-round discussions using subgraph information from different dimensions.

Result: Extensive experiments on FB15K and WN18RR show MAKGED outperforms state-of-the-art methods, enhancing accuracy and robustness of KG evaluation.

Conclusion: The framework enables training specialized agents using domain-specific knowledge graphs for error detection, demonstrating significant industrial application potential.

Abstract: Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine-grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision-making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine-grained, bidirectional subgraph embeddings with LLM-based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi-round discussions, thereby improving error detection accuracy and ensuring a transparent decision-making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain-specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse-ElEvEn/MAKGED.

[666] ViTE: Virtual Graph Trajectory Expert Router for Pedestrian Trajectory Prediction

Ruochen Li, Zhanxing Zhu, Tanqiu Qiao, Hubert P. H. Shum

Main category: cs.AI

TL;DR: ViTE is a novel framework for pedestrian trajectory prediction that uses virtual graphs and expert routing to model both explicit one-hop and implicit high-order interactions without deep GNN stacks, achieving state-of-the-art performance.

DetailsMotivation: Existing approaches face a fundamental trade-off between insufficient layers causing under-reaching problems and excessive depth leading to prohibitive computational costs. Current methods rely too heavily on architectural depth rather than adaptively modeling both explicit and implicit interactions.

Method: ViTE consists of two key modules: a Virtual Graph that introduces dynamic virtual nodes to model long-range and high-order interactions without deep GNN stacks, and an Expert Router that adaptively selects interaction experts based on social context using a Mixture-of-Experts design.

Result: Experiments on three benchmarks (ETH/UCY, NBA, and SDD) demonstrate that ViTE consistently achieves state-of-the-art performance, validating both its effectiveness and practical efficiency.

Conclusion: The proposed ViTE framework successfully addresses the limitations of deep GNN approaches by providing flexible and scalable reasoning across varying interaction patterns while maintaining computational efficiency.

Abstract: Pedestrian trajectory prediction is critical for ensuring safety in autonomous driving, surveillance systems, and urban planning applications. While early approaches primarily focus on one-hop pairwise relationships, recent studies attempt to capture high-order interactions by stacking multiple Graph Neural Network (GNN) layers. However, these approaches face a fundamental trade-off: insufficient layers may lead to under-reaching problems that limit the model’s receptive field, while excessive depth can result in prohibitive computational costs. We argue that an effective model should be capable of adaptively modeling both explicit one-hop interactions and implicit high-order dependencies, rather than relying solely on architectural depth. To this end, we propose ViTE (Virtual graph Trajectory Expert router), a novel framework for pedestrian trajectory prediction. ViTE consists of two key modules: a Virtual Graph that introduces dynamic virtual nodes to model long-range and high-order interactions without deep GNN stacks, and an Expert Router that adaptively selects interaction experts based on social context using a Mixture-of-Experts design. This combination enables flexible and scalable reasoning across varying interaction patterns. Experiments on three benchmarks (ETH/UCY, NBA, and SDD) demonstrate that our method consistently achieves state-of-the-art performance, validating both its effectiveness and practical efficiency.

[667] Beyond World Models: Rethinking Understanding in AI Models

Tarun Gupta, Danish Pruthi

Main category: cs.AI

TL;DR: The paper critically examines whether AI world models truly achieve human-level understanding by using philosophical case studies to highlight differences between world model capabilities and genuine understanding.

DetailsMotivation: To investigate if AI world models, which simulate external world aspects and enable prediction, actually demonstrate human-like understanding rather than just statistical correlations.

Method: Using case studies from philosophy of science literature to analyze where world model capabilities differ from human understanding, focusing on specific philosophical analyses.

Result: The philosophical case studies reveal limitations in how world models characterize understanding, showing distinctions between simulation capabilities and genuine human-level comprehension.

Conclusion: World models may not adequately capture human-level understanding, as philosophical analyses demonstrate important differences between simulation/prediction capabilities and true comprehension.

Abstract: World models have garnered substantial interest in the AI community. These are internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models “understand” the world in a human-like way. In this paper, we use case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they help us explore the limits of world models.

[668] CAMAR: Continuous Actions Multi-Agent Routing

Artem Pshenitsyn, Aleksandr Panov, Alexey Skrynnik

Main category: cs.AI

TL;DR: CAMAR is a new MARL benchmark for multi-agent pathfinding with continuous actions, supporting cooperative/competitive interactions and efficient execution at 100K steps/second, with integrated classical planning methods.

DetailsMotivation: Existing MARL benchmarks lack combination of continuous state/action spaces with challenging coordination and planning tasks, creating a need for more realistic testbeds.

Method: Developed CAMAR benchmark with continuous actions, three-tier evaluation protocol, and integration of classical planning methods (RRT, RRT*) into MARL pipelines as standalone baselines and hybrid approaches.

Result: CAMAR presents a challenging and realistic testbed that runs efficiently at 100,000 environment steps per second, with provided test scenarios and benchmarking tools for reproducibility.

Conclusion: CAMAR successfully addresses the gap in MARL benchmarks by providing a continuous-action pathfinding environment that enables deeper performance analysis and supports hybrid planning-MARL approaches.

Abstract: Multi-agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision-making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi-agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three-tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT* into MARL pipelines. We use them as standalone baselines and combine RRT* with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.

[669] AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos

Junhyuk Seo, Hyeyoon Moon, Kyu-Hwan Jung, Namkee Oh, Taerim Kim

Main category: cs.AI

TL;DR: AURA is a vision-based system that detects unplanned extubation risk in ICUs using synthetic video data generated via text-to-video diffusion, analyzing pose patterns for collision and agitation.

DetailsMotivation: Unplanned extubation is a critical safety issue in ICUs, but real-time detection has been limited due to ethical/privacy challenges with obtaining annotated ICU video data.

Method: Used text-to-video diffusion to generate synthetic ICU scenarios, applied pose estimation to detect two high-risk patterns: collision (hand entry near airway tubes) and agitation (velocity of anatomical keypoints).

Result: Expert assessments confirmed synthetic data realism; system showed high accuracy for collision detection and moderate performance for agitation recognition.

Conclusion: Demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems deployable in intensive care settings.

Abstract: Unplanned extubation (UE) remains a critical patient safety concern in intensive care units (ICUs), often leading to severe complications or death. Real-time UE detection has been limited, largely due to the ethical and privacy challenges of obtaining annotated ICU video data. We propose Augmented Unplanned Removal Alert (AURA), a vision-based risk detection system developed and validated entirely on a fully synthetic video dataset. By leveraging text-to-video diffusion, we generated diverse and clinically realistic ICU scenarios capturing a range of patient behaviors and care contexts. The system applies pose estimation to identify two high-risk movement patterns: collision, defined as hand entry into spatial zones near airway tubes, and agitation, quantified by the velocity of tracked anatomical keypoints. Expert assessments confirmed the realism of the synthetic data, and performance evaluations showed high accuracy for collision detection and moderate performance for agitation recognition. This work demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems with potential for deployment in intensive care settings.

[670] See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

Hantao Zhang, Jingyang Liu, Ed Li

Main category: cs.AI

TL;DR: Training-free agentic system combining VLM and LLMs to convert hand sketches into precise SVG diagrams through iterative editing loops with critic and judge components.

DetailsMotivation: Diffusion models struggle with spatial precision and symbolic structure needed for flowcharts, requiring a solution that can handle alignment, connectivity and compositional elements.

Method: Iterative loop system: Critic VLM proposes relational edits, multiple LLMs generate SVG updates with diverse strategies, Judge VLM selects best candidate to ensure stable improvement.

Result: Outperforms GPT-5 and Gemini-2.5-Pro on 10 sketch samples, better reconstructing layout and structure while accurately composing primitives without unwanted text.

Conclusion: The approach produces editable SVG programs, supports human-in-the-loop corrections, and is extensible to presentation tools via APIs with open-sourced implementation.

Abstract: We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.

[671] Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Yuxiang Zhou, Jichang Li, Yanhao Zhang, Haonan Lu, Guanbin Li

Main category: cs.AI

TL;DR: Mobile-Agent-RAG is a hierarchical multi-agent framework that uses dual-level retrieval augmentation to improve mobile agent performance on real-world, long-horizon tasks by addressing strategic hallucinations in planning and operational errors in UI execution.

DetailsMotivation: Current mobile agents have inadequate success rates on real-world tasks due to excessive reliance on static MLLM knowledge, leading to strategic hallucinations in planning and operational errors in UI execution. Different knowledge types are needed for planning (high-level strategy) vs execution (low-level UI instructions).

Method: Proposed Mobile-Agent-RAG with dual-level retrieval: Manager-RAG for planning stage retrieves human-validated task plans to reduce strategic hallucinations, and Operator-RAG for execution stage retrieves precise low-level guidance for accurate atomic actions. Built two specialized knowledge bases and Mobile-Eval-RAG benchmark.

Result: Significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2% on challenging multi-app, long-horizon tasks.

Conclusion: Mobile-Agent-RAG establishes a robust paradigm for context-aware, reliable multi-agent mobile automation by effectively addressing the distinct knowledge requirements of planning and execution through hierarchical retrieval augmentation.

Abstract: Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents’ excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

[672] Surrogate Modeling and Explainable Artificial Intelligence for Complex Systems: A Workflow for Automated Simulation Exploration

Paul Saves, Pramudita Satria Palar, Muhammad Daffa Robani, Nicolas Verstaevel, Moncef Garouani, Julien Aligon, Benoit Gaudou, Koji Shimoyama, Joseph Morlier

Main category: cs.AI

TL;DR: The paper proposes a workflow using lightweight emulators trained on compact experimental designs to address computational cost and transparency issues in simulation-driven engineering, enabling fast approximations, uncertainty quantification, and explainable AI analyses.

DetailsMotivation: To overcome two main challenges in simulation-driven engineering workflows: (1) high computational costs from many expensive simulator runs, and (2) limited transparency and reliability when using opaque blackbox components.

Method: Training lightweight emulators on compact designs of experiments that provide fast approximations, enable uncertainty quantification, and support global/local Explainable AI analyses. The methodology supports continuous/categorical inputs and combines global-effect/uncertainty analyses with local attribution.

Result: The approach enables large-scale exploration in seconds, uncovers nonlinear interactions and emergent behaviors, identifies key design/policy levers, and signals regions where surrogates need more data or alternative architectures. Demonstrated on hybrid-electric aircraft design and urban segregation agent-based model.

Conclusion: The surrogate model and XAI coupling provides an effective workflow for simulation-based complex-system analysis, addressing both computational efficiency and transparency challenges while guiding further data collection and model refinement.

Abstract: Complex systems are increasingly explored through simulation-driven engineering workflows that combine physics-based and empirical models with optimization and analytics. Despite their power, these workflows face two central obstacles: (1) high computational cost, since accurate exploration requires many expensive simulator runs; and (2) limited transparency and reliability when decisions rely on opaque blackbox components. We propose a workflow that addresses both challenges by training lightweight emulators on compact designs of experiments that (i) provide fast, low-latency approximations of expensive simulators, (ii) enable rigorous uncertainty quantification, and (iii) are adapted for global and local Explainable Artificial Intelligence (XAI) analyses. This workflow unifies every simulation-based complex-system analysis tool, ranging from engineering design to agent-based models for socio-environmental understanding. In this paper, we proposea comparative methodology and practical recommendations for using surrogate-based explainability tools within the proposed workflow. The methodology supports continuous and categorical inputs, combines global-effect and uncertainty analyses with local attribution, and evaluates the consistency of explanations across surrogate models, thereby diagnosing surrogate adequacy and guiding further data collection or model refinement. We demonstrate the approach on two contrasting case studies: a multidisciplinary design analysis of a hybrid-electric aircraft and an agent-based model of urban segregation. Results show that the surrogate model and XAI coupling enables large-scale exploration in seconds, uncovers nonlinear interactions and emergent behaviors, identifies key design and policy levers, and signals regions where surrogates require more data or alternative architectures.

[673] MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning

Zhiyu An, Wan Du

Main category: cs.AI

TL;DR: LLM agents can be trained to apply consistent moral reasoning frameworks (utilitarian, deontological, virtue ethics) to novel moral scenarios beyond their training distribution using Group Relative Policy Optimization with composite rewards.

DetailsMotivation: Large language models increasingly influence human moral decisions, but current approaches focus on evaluation rather than actively steering moral decisions, creating an out-of-distribution moral alignment problem.

Method: Created Moral-Reason-QA dataset with 680 human-annotated moral scenarios and framework-specific reasoning traces. Used Group Relative Policy Optimization with composite rewards to simultaneously optimize decision alignment and framework-specific reasoning processes.

Result: Successful generalization to unseen moral scenarios with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks on out-of-distribution evaluation sets.

Conclusion: LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing critical foundation for AI safety as language models become more integrated into human decision-making.

Abstract: Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out-of-distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision-making processes.

[674] SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning

Xuchen Li, Ruitao Wu, Xuanbo Liu, Xukai Wang, Jinbo Hu, Zhixin Bai, Bohan Zeng, Hao Liang, Leheng Chen, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Xu-Yao Zhang, Liu Liu, Jia Li, Kaiqi Huang, Jiahao Xu, Haitao Mi, Wentao Zhang, Bin Dong

Main category: cs.AI

TL;DR: SciAgent is a unified multi-agent system for generalistic scientific reasoning that achieves expert-level performance across multiple scientific disciplines by dynamically orchestrating specialized agents in hierarchical problem-solving pipelines.

DetailsMotivation: Current AI systems achieve expert-level performance on specific scientific tasks but remain narrow and handcrafted, lacking the ability to adapt reasoning strategies across different disciplines and difficulty levels.

Method: SciAgent uses a hierarchical multi-agent system where a Coordinator Agent interprets problems and orchestrates specialized Worker Systems composed of reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification.

Result: SciAgent consistently attains or surpasses human gold-medalist performance across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), and also performs well on International Chemistry Olympiad and Humanity’s Last Exam benchmark problems.

Conclusion: SciAgent represents a concrete step toward generalistic scientific intelligence - AI systems capable of coherent, cross-disciplinary reasoning at expert levels.

Abstract: Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem’s domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity’s Last Exam (HLE) benchmark, further confirming the system’s ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.

[675] UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinh, Pablo Mendes, Andrew Rabinovich

Main category: cs.AI

TL;DR: UpBench is a dynamic benchmark using real Upwork jobs to evaluate LLM agents’ real-world competence, adaptability, and human collaboration capacity through expert-designed rubrics and financial outcomes.

DetailsMotivation: Existing benchmarks are static, synthetic, or domain-limited, failing to assess how AI agents perform in dynamic, economically meaningful environments with genuine work activity.

Method: Uses verified Upwork job transactions with expert freelancers creating detailed acceptance criteria rubrics, then evaluates AI submissions with per-criterion feedback in a human-centered evaluation framework.

Result: Enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary metrics, anchored in real professional standards and financial outcomes.

Conclusion: UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, supporting AI-human partnership rather than replacement.

Abstract: As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

[676] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng

Main category: cs.AI

TL;DR: RGR-GRPO is a rubric-driven RL framework that uses rubrics for fine-grained rewards and offline guidance, enabling LLMs to explore larger solution spaces and achieve superior performance across multiple reasoning domains.

DetailsMotivation: Existing RL methods for LLMs focus on single domains with verifiable rewards and use purely online RL, which restricts exploration space and limits reasoning performance.

Method: Proposes RGR-GRPO framework that leverages rubrics to provide dense reward signals and offline guidance during GRPO training, enabling larger solution space exploration.

Result: Achieves average improvements of +7.0% (math), +5.4% (physics), +8.4% (chemistry), +6.6% (general reasoning) over verifiable online RL baseline across 14 benchmarks, with stable entropy and superior pass@k performance.

Conclusion: RGR-GRPO effectively overcomes limitations of existing RL methods by using rubrics for both rewards and guidance, enabling sustained exploration and breakthrough beyond performance bottlenecks in multi-domain reasoning.

Abstract: Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $\textbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

[677] More Than Irrational: Modeling Belief-Biased Agents

Yifan Zhu, Sammie Katt, Samuel Kaski

Main category: cs.AI

TL;DR: The paper introduces computational-rational user models for cognitively-bounded agents, focusing on how memory limitations lead to biased beliefs and sub-optimal decisions. It proposes an online inference method to identify cognitive bounds from observed actions.

DetailsMotivation: Predicting sub-optimal human behavior remains challenging, as such behaviors often stem from rational decisions under cognitive bounds and biased beliefs rather than irrationality.

Method: Formalizes computational-rational models with explicit cognitive processes, proposes nested particle filtering for online inference of latent belief states and cognitive bounds from observed actions.

Result: Validated in navigation tasks with memory decay, showing the model generates plausible behaviors and accurately recovers ground-truth cognitive bounds from ≤100 observation steps.

Conclusion: Provides principled foundation for adaptive AI assistants that account for users’ memory limitations, enabling more effective human-AI collaboration.

Abstract: Despite the explosive growth of AI and the technologies built upon it, predicting and inferring the sub-optimal behavior of users or human collaborators remains a critical challenge. In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world. In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. The key novelty lies in explicitly modeling how a bounded memory process leads to a dynamically inconsistent and biased belief state and, consequently, sub-optimal sequential decision-making. We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly. We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable. To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user’s latent belief state and estimates the unknown cognitive bound from a stream of observed actions. We validate our approach in a representative navigation task using memory decay as an example of a cognitive bound. With simulations, we show that (1) our CR model generates intuitively plausible behaviors corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ($\le 100$ steps). We further demonstrate how this approach provides a principled foundation for developing adaptive AI assistants, enabling adaptive assistance that accounts for the user’s memory limitations.

[678] Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

Dylan M. Asmar, Mykel J. Kochenderfer

Main category: cs.AI

TL;DR: A framework for autonomous agents to dynamically learn and adapt to varying reliability of external action suggestions in partially observable environments, using Bayesian inference and strategic “ask” actions.

DetailsMotivation: Existing methods assume static and known suggester quality, limiting practical deployment. Agents need to handle varying reliability of external suggestions in uncertain environments.

Method: Integrate suggester quality into belief representation with Bayesian inference over suggester types, and introduce explicit “ask” action for strategic suggestion requests.

Result: Experimental evaluation shows robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests.

Conclusion: Provides foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.

Abstract: Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provide valuable guidance but inherently vary in reliability. Existing methods for incorporating such advice typically assume static and known suggester quality parameters, limiting practical deployment. We introduce a framework that dynamically learns and adapts to varying suggester reliability in partially observable environments. First, we integrate suggester quality directly into the agent’s belief representation, enabling agents to infer and adjust their reliance on suggestions through Bayesian inference over suggester types. Second, we introduce an explicit ``ask’’ action allowing agents to strategically request suggestions at critical moments, balancing informational gains against acquisition costs. Experimental evaluation demonstrates robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests. This work provides a foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.

[679] ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li, Jiaqi Liu, Junchi Yu, Lihao Liu, Mingyu Ding, Wanli Ouyang, Shixiang Tang, Xi Chen

Main category: cs.AI

TL;DR: The paper introduces ARCHE, a task for extracting structured reasoning chains from LLM outputs using Peirce’s inference modes, and reveals current models struggle with complete scientific reasoning.

DetailsMotivation: LLMs produce informal reasoning outputs that obscure whether they truly understand scientific inference paradigms, necessitating structured evaluation of reasoning capabilities.

Method: Proposed Latent Reasoning Chain Extraction (ARCHE) task using Reasoning Logic Trees with three Peirce inference modes, created ARCHE Bench benchmark from 70 Nature Communications articles, and introduced logic-aware metrics (Entity Coverage and Reasoning Edge Accuracy).

Result: Evaluation of 10 leading LLMs shows trade-off between reasoning accuracy and content coverage, with no model able to extract complete standard reasoning chains, highlighting significant capability gaps.

Conclusion: There’s a substantial gap between current reasoning models’ abilities and the rigor required for scientific argumentation, indicating need for improved reasoning extraction methods.

Abstract: Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

[680] LOBERT: Generative AI Foundation Model for Limit Order Book Messages

Eljas Linna, Kestutis Baltakys, Alexandros Iosifidis, Juho Kanniainen

Main category: cs.AI

TL;DR: LOBERT is a BERT-based foundation model for Limit Order Book data that treats multi-dimensional messages as single tokens, achieving state-of-the-art performance in predicting mid-price movements and next messages with reduced context length.

DetailsMotivation: Existing LOB models have cumbersome data representations and lack adaptability for different tasks, making them unsuitable for general-purpose use in financial limit order book analysis.

Method: Adapts BERT architecture for LOB data with novel tokenization that treats complete multi-dimensional messages as single tokens while maintaining continuous representations of price, volume, and time.

Result: Achieves leading performance in mid-price movement prediction and next message prediction tasks while requiring shorter context lengths compared to previous methods.

Conclusion: LOBERT provides a general-purpose foundation model for LOB data that is suitable for downstream fine-tuning and outperforms previous approaches in key prediction tasks.

Abstract: Modeling the dynamics of financial Limit Order Books (LOB) at the message level is challenging due to irregular event timing, rapid regime shifts, and the reactions of high-frequency traders to visible order flow. Previous LOB models require cumbersome data representations and lack adaptability outside their original tasks, leading us to introduce LOBERT, a general-purpose encoder-only foundation model for LOB data suitable for downstream fine-tuning. LOBERT adapts the original BERT architecture for LOB data by using a novel tokenization scheme that treats complete multi-dimensional messages as single tokens while retaining continuous representations of price, volume, and time. With these methods, LOBERT achieves leading performance in tasks such as predicting mid-price movements and next messages, while reducing the required context length compared to previous methods.

[681] Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models

Yongwen Ren, Chao Wang, Peng Du, Chuan Qin, Dazhong Shen, Hui Xiong

Main category: cs.AI

TL;DR: PCRS-TKA is a prompt-based framework that integrates pretrained language models with knowledge graphs for conversational recommender systems, using retrieval-augmented generation, dialogue-specific knowledge trees, and collaborative preference modeling to improve accuracy and reduce hallucination.

DetailsMotivation: To address limitations in existing methods that fail to fully exploit PLM reasoning over graph relationships, indiscriminately incorporate retrieved knowledge without context filtering, and neglect collaborative preferences in multi-turn dialogues.

Method: Constructs dialogue-specific knowledge trees from KGs and serializes them into texts for structure-aware reasoning, selectively filters context-relevant knowledge, explicitly models collaborative preferences with specialized supervision signals, and uses a semantic alignment module to harmonize heterogeneous inputs.

Result: Extensive experiments show PCRS-TKA consistently outperforms all baselines in both recommendation and conversational quality.

Conclusion: The proposed framework effectively integrates PLMs with KGs through prompt-based retrieval-augmented generation, achieving superior performance in conversational recommender systems by addressing key challenges in knowledge integration and collaborative preference modeling.

Abstract: Recent advances in pretrained language models (PLMs) have significantly improved conversational recommender systems (CRS), enabling more fluent and context-aware interactions. To further enhance accuracy and mitigate hallucination, many methods integrate PLMs with knowledge graphs (KGs), but face key challenges: failing to fully exploit PLM reasoning over graph relationships, indiscriminately incorporating retrieved knowledge without context filtering, and neglecting collaborative preferences in multi-turn dialogues. To this end, we propose PCRS-TKA, a prompt-based framework employing retrieval-augmented generation to integrate PLMs with KGs. PCRS-TKA constructs dialogue-specific knowledge trees from KGs and serializes them into texts, enabling structure-aware reasoning while capturing rich entity semantics. Our approach selectively filters context-relevant knowledge and explicitly models collaborative preferences using specialized supervision signals. A semantic alignment module harmonizes heterogeneous inputs, reducing noise and enhancing accuracy. Extensive experiments demonstrate that PCRS-TKA consistently outperforms all baselines in both recommendation and conversational quality.

[682] Dynamic Tree Databases in Automated Planning

Oliver Joergensen, Dominik Drexler, Jendrik Seipp

Main category: cs.AI

TL;DR: Dynamic tree databases for efficient state compression in planning tasks

DetailsMotivation: Addressing the challenge of compactly representing generated states in large-scale state-space search, particularly overcoming the memory preallocation limitations of static tree databases.

Method: Proposed a dynamic variant of tree databases for compressing state sets over propositional and numeric variables, maintaining desirable properties of static counterparts.

Result: Achieved compression ratios of several orders of magnitude with negligible runtime overhead in empirical evaluation on classical and numeric planning tasks.

Conclusion: Dynamic tree databases provide an effective solution for state compression in planning, offering significant space savings without substantial performance costs.

Abstract: A central challenge in scaling up explicit state-space search for large tasks is compactly representing the set of generated states. Tree databases, a data structure from model checking, require constant space per generated state in the best case, but they need a large preallocation of memory. We propose a novel dynamic variant of tree databases for compressing state sets over propositional and numeric variables and prove that it maintains the desirable properties of the static counterpart. Our empirical evaluation of state compression techniques for grounded and lifted planning on classical and numeric planning tasks reveals compression ratios of several orders of magnitude, often with negligible runtime overhead.

[683] Optimal Foraging in Memory Retrieval: Evaluating Random Walks and Metropolis-Hastings Sampling in Modern Semantic Spaces

James Moore

Main category: cs.AI

TL;DR: Random walks on modern embedding spaces produce optimal foraging patterns matching human memory retrieval, while more complex sampling methods like Metropolis-Hastings do not align with human behavior.

DetailsMotivation: To determine whether modern high-dimensional embeddings can support algorithms that match human memory foraging patterns observed in semantic fluency tasks, particularly whether complex sampling mechanisms are necessary.

Method: Used state-of-the-art embeddings and prior semantic fluency data to test random walks and Metropolis-Hastings sampling on embedding spaces to model memory retrieval.

Result: Random walks produced results consistent with optimal foraging and Marginal Value Theorem, while Metropolis-Hastings sampling did not match human behavior patterns.

Conclusion: Simple random walks on appropriately structured embeddings can produce near-optimal foraging dynamics, challenging the assumption that complex sampling mechanisms are needed for cognitive models of memory retrieval.

Abstract: Human memory retrieval often resembles ecological foraging where animals search for food in a patchy environment. Optimal foraging means following the Marginal Value Theorem (MVT), in which individuals exploit a patch of semantically related concepts until it becomes less rewarding and then switch to a new cluster. While human behavioral data suggests foraging-like patterns in semantic fluency tasks, it remains unclear whether modern high-dimensional embedding spaces provide representations that allow algorithms to match observed human behavior. Using state-of-the-art embeddings and prior semantic fluency data, I find that random walks on these embedding spaces produce results consistent with optimal foraging and the MVT. Surprisingly, introducing Metropolis-Hastings sampling, an adaptive algorithm expected to model strategic acceptance and rejection of new clusters, does not produce results consistent with human behavior. These findings challenge the assumption that more complex sampling mechanisms inherently lead to better cognitive models of memory retrieval. Instead, they show that appropriately structured embeddings, even with simple sampling, can produce near-optimal foraging dynamics. This supports the perspective of Hills (2012) rather than Abbott (2015), demonstrating that modern embeddings can approximate human memory foraging without relying on complex acceptance criteria.

[684] Event-CausNet: Unlocking Causal Knowledge from Text with Large Language Models for Reliable Spatio-Temporal Forecasting

Luyao Niu, Zepu Wang, Shuyi Guan, Yang Liu, Peng Sun

Main category: cs.AI

TL;DR: Event-CausNet uses LLMs to quantify event reports, builds causal knowledge, and injects it into GNN-LSTM networks via causal attention, achieving 35.87% MAE reduction for robust traffic forecasting during disruptions.

DetailsMotivation: Standard GNNs fail during non-recurring traffic events because they rely on correlational patterns that become invalid when new causal factors (like accidents) disrupt normal traffic flows.

Method: Framework that uses LLM to quantify unstructured event reports, builds causal knowledge base via average treatment effects estimation, and injects this knowledge into dual-stream GNN-LSTM using causal attention mechanism.

Result: Achieves 35.87% reduction in prediction error (MAE), significantly outperforming state-of-the-art baselines on real-world datasets during traffic disruptions.

Conclusion: Bridges gap between correlational models and causal reasoning, providing more accurate, transferable, and interpretable traffic forecasting during critical disruptions.

Abstract: While spatio-temporal Graph Neural Networks (GNNs) excel at modeling recurring traffic patterns, their reliability plummets during non-recurring events like accidents. This failure occurs because GNNs are fundamentally correlational models, learning historical patterns that are invalidated by the new causal factors introduced during disruptions. To address this, we propose Event-CausNet, a framework that uses a Large Language Model to quantify unstructured event reports, builds a causal knowledge base by estimating average treatment effects, and injects this knowledge into a dual-stream GNN-LSTM network using a novel causal attention mechanism to adjust and enhance the forecast. Experiments on a real-world dataset demonstrate that Event-CausNet achieves robust performance, reducing prediction error (MAE) by up to 35.87%, significantly outperforming state-of-the-art baselines. Our framework bridges the gap between correlational models and causal reasoning, providing a solution that is more accurate and transferable, while also offering crucial interpretability, providing a more reliable foundation for real-world traffic management during critical disruptions.

[685] Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization

Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Zehong Cao, Ryszard Kowalczyk

Main category: cs.AI

TL;DR: This paper investigates resource optimization in heterogeneous satellite clusters for Earth Observation missions using Reinforcement Learning and Multi-Agent Reinforcement Learning approaches.

DetailsMotivation: Traditional optimization methods struggle with real-time, uncertain, and decentralized nature of Earth Observation operations, motivating the use of RL/MARL for adaptive decision-making in satellite clusters.

Method: Systematically formulates optimization from single to multi-satellite scenarios, using near-realistic simulation environment (Basilisk/BSK-RL) to evaluate MAPPO, HAPPO, and HATRPO algorithms for heterogeneous satellite coordination.

Result: MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non-stationarity and inter-agent reward coupling.

Conclusion: Provides practical insights into scalable autonomous satellite operations and contributes foundation for future research on intelligent EO mission planning under heterogeneous dynamic conditions.

Abstract: This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditional optimization methods struggle to handle the real-time, uncertain, and decentralized nature of EO operations, motivating the use of RL and Multi-Agent Reinforcement Learning (MARL) for adaptive decision-making. This study systematically formulates the optimization problem from single-satellite to multi-satellite scenarios, addressing key challenges including energy and memory constraints, partial observability, and agent heterogeneity arising from diverse payload capabilities. Using a near-realistic simulation environment built on the Basilisk and BSK-RL frameworks, we evaluate the performance and stability of state-of-the-art MARL algorithms such as MAPPO, HAPPO, and HATRPO. Results show that MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non-stationarity and inter-agent reward coupling. The findings provide practical insights into scalable, autonomous satellite operations and contribute a foundation for future research on intelligent EO mission planning under heterogeneous and dynamic conditions.

[686] Neuro-Logic Lifelong Learning

Bowen He, Xiaoan Xu, Alper Kamil Bozkurt, Vahid Tarokh, Juncheng Dong

Main category: cs.AI

TL;DR: The paper proposes a lifelong learning approach for Inductive Logic Programming (ILP) that reuses logic rules across sequential tasks to improve scalability and performance.

DetailsMotivation: Most ILP research focuses on single problems, but there's limited exploration of learning paradigms involving sequences of problems. The authors aim to leverage the compositional and transferable nature of logic rules for efficient lifelong learning.

Method: The authors introduce a compositional framework where logic rules acquired from earlier tasks are efficiently reused in subsequent ones, formalizing this approach for lifelong learning ILP.

Result: Experimental evaluation on sequences of tasks validates the feasibility and advantages of this paradigm, showing improved scalability and performance compared to traditional approaches.

Conclusion: The work opens new directions for continual learning in Neural-Symbolic AI by demonstrating how compositional logic rule reuse enables efficient lifelong learning across sequential ILP problems.

Abstract: Solving Inductive Logic Programming (ILP) problems with neural networks is a key challenge in Neural-Symbolic Ar- tificial Intelligence (AI). While most research has focused on designing novel network architectures for individual prob- lems, less effort has been devoted to exploring new learning paradigms involving a sequence of problems. In this work, we investigate lifelong learning ILP, which leverages the com- positional and transferable nature of logic rules for efficient learning of new problems. We introduce a compositional framework, demonstrating how logic rules acquired from ear- lier tasks can be efficiently reused in subsequent ones, leading to improved scalability and performance. We formalize our approach and empirically evaluate it on sequences of tasks. Experimental results validate the feasibility and advantages of this paradigm, opening new directions for continual learn- ing in Neural-Symbolic AI.

[687] Mapping fNIRS Signals to Agent Performance: Toward Reinforcement Learning from Neural Feedback

Julia Santaniello, Matthew Russell, Benson Jiang, Donatello Sassaroli, Robert Jacob, Jivko SInapov

Main category: cs.AI

TL;DR: This paper introduces a framework using passive Brain-Computer Interfaces (BCI) with fNIRS recordings to guide reinforcement learning from human feedback, achieving promising classification and regression results for predicting agent performance from neural signals.

DetailsMotivation: To develop a brain-driven RLHF system that can align agent behavior with human preferences using implicit neural signals rather than explicit feedback, potentially enabling more natural and efficient human-AI interaction.

Method: Used functional near-infrared spectroscopy (fNIRS) recordings from 25 participants across three domains (Pick-and-Place Robot, Lunar Lander, Flappy Bird), trained classifiers to predict agent performance levels and regressors to predict deviation from optimal policies, with cross-subject generalization evaluation.

Result: Achieved 67% F1 score for binary classification and 46% for multi-class models averaged across conditions and domains. Fine-tuning with subject-specific data increased F1 scores by 17% (binary) and 41% (multi-class). Successfully trained regressors to predict continuous performance deviation.

Conclusion: Mapping implicit fNIRS signals to agent performance is feasible and can be significantly improved with subject-specific fine-tuning, laying the foundation for future brain-driven RLHF systems that could enable more seamless human-AI collaboration.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by integrating human feedback into the agent’s training process. We introduce a possible framework that employs passive Brain-Computer Interfaces (BCI) to guide agent training from implicit neural signals. We present and release a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants across three domains: a Pick-and-Place Robot, Lunar Lander, and Flappy Bird. We train classifiers to predict levels of agent performance (optimal, sub-optimal, or worst-case) from windows of preprocessed fNIRS feature vectors, achieving an average F1 score of 67% for binary classification and 46% for multi-class models averaged across conditions and domains. We also train regressors to predict the degree of deviation between an agent’s chosen action and a set of near-optimal policies, providing a continuous measure of performance. We evaluate cross-subject generalization and demonstrate that fine-tuning pre-trained models with a small sample of subject-specific data increases average F1 scores by 17% and 41% for binary and multi-class models, respectively. Our work demonstrates that mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future brain-driven RLHF systems.

[688] Bootstrapping LLMs via Preference-Based Policy Optimization

Chen Jia

Main category: cs.AI

TL;DR: A novel preference-based policy optimization framework that uses min-max game between policy and reward model with theoretical guarantees and superior performance over SOTA methods.

DetailsMotivation: To align LLMs with human preferences without extensive manual annotations through bootstrapping and iterative self-improvement.

Method: Min-max game between main policy and reward model constrained within confidence set from preference data, with iterative online algorithm for guided exploration and continual improvement.

Result: Outperforms existing state-of-the-art preference optimization techniques on five benchmarks with theoretical guarantees and high-probability regret bounds.

Conclusion: The proposed PbPO framework effectively bootstraps LLMs through preference-based optimization with reliable theoretical foundations and empirical superiority.

Abstract: Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

[689] Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

Heyang Ma, Qirui Mi, Qipeng Yang, Zijun Fan, Bo Li, Haifeng Zhang

Main category: cs.AI

TL;DR: LAMP integrates language processing with multi-agent reinforcement learning for economic decision-making, outperforming traditional methods in returns, robustness, and interpretability.

DetailsMotivation: Economic decisions rely on both structured signals (prices, taxes) and unstructured language (dialogue, narratives), but current MARL struggles with semantic ambiguity and contextual richness of language.

Method: LAMP uses a Think-Speak-Decide pipeline: Think interprets observations for shocks/trends, Speak crafts strategic messages and parses peer communications, Decide fuses data with MARL policy for optimized decisions.

Result: LAMP outperforms MARL and LLM-only baselines in economic simulations: +63.5% and +34.0% in cumulative return, +18.8% and +59.4% in robustness, plus improved interpretability.

Conclusion: Language-augmented policies like LAMP can deliver more effective and robust economic strategies by bridging the gap between numerical data and linguistic context.

Abstract: Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.

[690] Online Learning of HTN Methods for integrated LLM-HTN Planning

Yuesheng Xu, Hector Munoz-Avila

Main category: cs.AI

TL;DR: Online learning of generalized HTN methods from ChatGPT-generated decompositions to reduce LLM calls while maintaining or improving problem-solving performance.

DetailsMotivation: To reduce reliance on expensive ChatGPT API calls in HTN planning by learning reusable methods from generated decompositions, improving efficiency while maintaining planning capabilities.

Method: Extends ChatHTN planner by learning generalized HTN methods from ChatGPT-generated task decompositions through online learning, creating reusable methods instead of just memoizing specific instances.

Result: Experiments show reduced ChatGPT calls while solving at least as many problems, and sometimes more problems than the baseline approach.

Conclusion: Online learning of generalized HTN methods from LLM-generated decompositions effectively reduces dependency on expensive LLM calls while maintaining or enhancing planning performance.

Abstract: We present online learning of Hierarchical Task Network (HTN) methods in the context of integrated HTN planning and LLM-based chatbots. Methods indicate when and how to decompose tasks into subtasks. Our method learner is built on top of the ChatHTN planner. ChatHTN queries ChatGPT to generate a decomposition of a task into primitive tasks when no applicable method for the task is available. In this work, we extend ChatHTN. Namely, when ChatGPT generates a task decomposition, ChatHTN learns from it, akin to memoization. However, unlike memoization, it learns a generalized method that applies not only to the specific instance encountered, but to other instances of the same task. We conduct experiments on two domains and demonstrate that our online learning procedure reduces the number of calls to ChatGPT while solving at least as many problems, and in some cases, even more.

[691] CoS: Towards Optimal Event Scheduling via Chain-of-Scheduling

Yiming Zhao, Jiwei Tang, Shimin Di, Libin Zheng, Jianxing Yu, Jian Yin

Main category: cs.AI

TL;DR: CoS framework uses LLMs for event scheduling in EBSNs through exploration, verification, and integration stages, achieving near-optimal results with high efficiency and zero-shot learning capability.

DetailsMotivation: Existing event scheduling methods in EBSNs face trade-offs between efficiency, effectiveness, and generalization due to the NP-hard nature of the problem.

Method: Chain-of-Scheduling (CoS) framework that activates LLMs’ scheduling capability through three atomic stages (exploration, verification, integration) and enables autonomous generation via Knowledge Distillation.

Result: Achieves near-theoretical optimal effectiveness with high efficiency on three real-world datasets, with strong zero-shot learning ability on out-of-domain data.

Conclusion: CoS provides an interpretable and efficient solution for event scheduling in EBSNs while maintaining strong generalization capabilities.

Abstract: Recommending event schedules is a key issue in Event-based Social Networks (EBSNs) in order to maintain user activity. An effective recommendation is required to maximize the user’s preference, subjecting to both time and geographical constraints. Existing methods face an inherent trade-off among efficiency, effectiveness, and generalization, due to the NP-hard nature of the problem. This paper proposes the Chain-of-Scheduling (CoS) framework, which activates the event scheduling capability of Large Language Models (LLMs) through a guided, efficient scheduling process. CoS enhances LLM by formulating the schedule task into three atomic stages, i.e., exploration, verification and integration. Then we enable the LLMs to generate CoS autonomously via Knowledge Distillation (KD). Experimental results show that CoS achieves near-theoretical optimal effectiveness with high efficiency on three real-world datasets in a interpretable manner. Moreover, it demonstrates strong zero-shot learning ability on out-of-domain data.

[692] Fault2Flow: An AlphaEvolve-Optimized Human-in-the-Loop Multi-Agent System for Fault-to-Workflow Automation

Yafang Wang, Yangjie Tian, Xiaoyu Shen, Gaoyang Zhang, Jiaze Sun, He Zhang, Ruohua Xu, Feng Zhao

Main category: cs.AI

TL;DR: Fault2Flow is an LLM-based multi-agent system that automates power grid fault diagnosis by extracting regulatory logic, integrating expert knowledge, optimizing reasoning, and generating executable workflows.

DetailsMotivation: Current power grid fault diagnosis relies on manual, error-prone methods that struggle to combine regulatory logic with expert knowledge, lacks maintainability, and is inefficient.

Method: Systematically extracts regulatory logic into fault trees, integrates expert knowledge via human-in-the-loop verification, optimizes reasoning with AlphaEvolve module, and synthesizes verified logic into executable workflows.

Result: Experimental validation shows 100% topological consistency and high semantic fidelity on transformer fault diagnosis datasets, substantially reducing expert workload.

Conclusion: Fault2Flow establishes a reproducible path from fault analysis to operational automation, bridging the gap between regulatory logic and expert knowledge in power grid fault diagnosis.

Abstract: Power grid fault diagnosis is a critical process hindered by its reliance on manual, error-prone methods. Technicians must manually extract reasoning logic from dense regulations and attempt to combine it with tacit expert knowledge, which is inefficient, error-prone, and lacks maintainability as ragulations are updated and experience evolves. While Large Language Models (LLMs) have shown promise in parsing unstructured text, no existing framework integrates these two disparate knowledge sources into a single, verified, and executable workflow. To bridge this gap, we propose Fault2Flow, an LLM-based multi-agent system. Fault2Flow systematically: (1) extracts and structures regulatory logic into PASTA-formatted fault trees; (2) integrates expert knowledge via a human-in-the-loop interface for verification; (3) optimizes the reasoning logic using a novel AlphaEvolve module; and (4) synthesizes the final, verified logic into an n8n-executable workflow. Experimental validation on transformer fault diagnosis datasets confirms 100% topological consistency and high semantic fidelity. Fault2Flow establishes a reproducible path from fault analysis to operational automation, substantially reducing expert workload.

[693] Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun

Main category: cs.AI

TL;DR: Yanyun-3 is a general-purpose agent framework that enables autonomous cross-platform operation across three heterogeneous strategy games by integrating vision-language reasoning with precise UI execution capabilities.

DetailsMotivation: Automated operation in cross-platform strategy games requires agents with robust generalization across diverse user interfaces and dynamic battlefield conditions, but current vision-language models' application to complex human-computer interaction scenarios like strategy gaming remains largely unexplored.

Method: Integrates Qwen2.5-VL’s vision-language reasoning with UI-TARS’s precise execution capabilities, using a closed-loop pipeline of screen capture, model inference, and action execution. Evaluates multimodal data combinations through systematic ablation studies and proposes combination granularity concepts.

Result: Hybrid strategy (MV+S) substantially outperforms full fusion: reduces inference time by 63% and boosts BLEU-4 score by 12.98x (from 4.81% to 62.41%). Successfully performs core tasks including target localization, combat resource allocation, and area control with strong real-time performance.

Conclusion: Establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering insights into the interplay between static perception and dynamic reasoning in embodied intelligence, providing an efficient solution for strategy game automation.

Abstract: Automated operation in cross-platform strategy games demands agents with robust generalization across diverse user interfaces and dynamic battlefield conditions. While vision-language models (VLMs) have shown considerable promise in multimodal reasoning, their application to complex human-computer interaction scenarios–such as strategy gaming–remains largely unexplored. Here, we introduce Yanyun-3, a general-purpose agent framework that, for the first time, enables autonomous cross-platform operation across three heterogeneous strategy game environments. By integrating the vision-language reasoning of Qwen2.5-VL with the precise execution capabilities of UI-TARS, Yanyun-3 successfully performs core tasks including target localization, combat resource allocation, and area control. Through systematic ablation studies, we evaluate the effects of various multimodal data combinations–static images, multi-image sequences, and videos–and propose the concept of combination granularity to differentiate between intra-sample fusion and inter-sample mixing strategies. We find that a hybrid strategy, which fuses multi-image and video data while mixing in static images (MV+S), substantially outperforms full fusion: it reduces inference time by 63% and boosts the BLEU-4 score by a factor of 12 (from 4.81% to 62.41%, approximately 12.98x). Operating via a closed-loop pipeline of screen capture, model inference, and action execution, the agent demonstrates strong real-time performance and cross-platform generalization. Beyond providing an efficient solution for strategy game automation, our work establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering new insights into the interplay between static perception and dynamic reasoning in embodied intelligence.

[694] MedRule-KG: A Knowledge-Graph–Steered Scaffold for Reliable Mathematical and Biomedical Reasoning

Crystal Su

Main category: cs.AI

TL;DR: MedRule-KG uses a knowledge-graph scaffold and verifier to steer LLM generation toward domain-valid outputs in scientific reasoning and drug discovery, reducing violations by 83.2% while improving accuracy.

DetailsMotivation: To impose domain-consistent structure on LLMs for scientific reasoning and drug discovery, ensuring mathematically and biomedically valid outputs.

Method: Combines a compact knowledge-graph scaffold with a lightweight verifier that injects symbolic facts into prompts and enforces rule satisfaction through constrained inference with soft guidance for decoding.

Result: Reduces violation counts by 83.2% across 90 tasks (reaction feasibility, metabolic compatibility, toxicity screening) while improving exact match, with stable results under stratification and negligible latency.

Conclusion: MedRule-KG provides a practical approach for interactive drug design by effectively steering LLM generation toward domain-valid outputs with minimal computational overhead.

Abstract: We study how to impose domain-consistent structure on large language models (LLMs) used for scientific reasoning and early-stage drug discovery. We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs. The system injects curated symbolic facts into prompts and then enforces rule satisfaction with a deterministic checker. We formalize generation as constrained inference, introduce a soft guidance surrogate suitable for decoding, and perform a thorough statistical analysis with uncertainty quantification. Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2% relative to a strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size, and the verifier adds negligible latency, making the approach practical for interactive design.

[695] WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

Genglin Liu, Shijie Geng, Sha Li, Hejie Cui, Sarah Zhang, Xin Liu, Tianyi Liu

Main category: cs.AI

TL;DR: WebCoach is a self-evolving framework that gives web browsing agents persistent cross-session memory, enabling continual learning without retraining by standardizing navigation logs, storing episodic experiences, and providing runtime advice.

DetailsMotivation: Current multimodal LLM-powered agents struggle with repetitive errors and lack learning across sessions, limiting long-term robustness and sample efficiency in web navigation tasks.

Method: WebCoach uses three components: WebCondenser (standardizes raw logs), External Memory Store (organizes trajectories), and Coach (retrieves relevant experiences and injects task-specific advice via runtime hooks).

Result: On WebVoyager benchmark, WebCoach improved task success rates from 47% to 61% with a 38B model while reducing steps. Smaller models with WebCoach achieved GPT-4o-level performance.

Conclusion: WebCoach enables web agents to access long-term memory beyond context windows, improving robustness in complex browsing tasks and achieving self-evolution through continuous memory curation without retraining.

Abstract: Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.

[696] GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs

Yiyang Zhao, Huiyu Bai, Xuejiao Zhao

Main category: cs.AI

TL;DR: GEM is a generative entropy-guided preference modeling approach for LLM alignment in low-resource scenarios, using cognitive filtering and self-evaluated group advantage instead of traditional reward models.

DetailsMotivation: In professional domains like medicine and law, large-scale preference labels are often unavailable, making traditional alignment methods impractical. There's a need for efficient few-shot alignment approaches that don't require abundant annotations.

Method: Uses cognitive filtering based on entropy theory with Chain-of-Thought prompting to generate diverse reasoning chains, then ranks them using token scoring. Fine-tunes LLM with self-evaluated group advantage (SEGA) algorithm that aggregates group-level cognitive signals and transforms entropy-based scores into implicit rewards.

Result: Experiments on general benchmarks and domain-specific tasks (mathematical reasoning, medical dialogues) show significant improvements with few-shot preference data.

Conclusion: GEM establishes an entropy-guided closed-loop cognitive optimization framework that enables highly efficient few-shot alignment of LLMs, allowing models to rely on their own judgments without extensive external annotations.

Abstract: Alignment of large language models (LLMs) with human preferences typically relies on supervised reward models or external judges that demand abundant annotations. However, in fields that rely on professional knowledge, such as medicine and law, such large-scale preference labels are often unachievable. In this paper, we propose a generative entropy-guided preference modeling approach named GEM for LLMs aligment at low-resource and domain-specific scenarios. Instead of training a discriminative reward model on preference data, we directly train the LLM to internalize a closed-loop optimization architecture that can extract and exploit the multi-dimensional, fine-grained cognitive signals implicit in human preferences. Specifically, our Cognitive Filtering module, based on entropy theory in decision making, first leverages Chain-of-Thought (CoT) prompting to generate diverse candidate reasoning chains (CoTs) from preference data. Subsequently, it introduces a token scoring mechanism to rank and weight the sampled CoTs, boosting the importance of high-confidence answers and strategically high-entropy tokens. Building on these filtered preferences, we fine-tune the LLM using a novel self-evaluated group advantage algorithm, SEGA, which effectively aggregates group-level cognitive signals and transforms the entropy-based scores into implicit rewards for policy optimization. In these ways, GEM empowers the LLM to rely on its own judgments and establishes an entropy-guided closed-loop cognitive optimization framework, enabling highly efficient few-shot alignment of LLMs. Experiments on general benchmarks and domain-specific tasks (such as mathematical reasoning and medical dialogues) demonstrate that our GEM achieves significant improvements with few-shot preference data.

[697] PragWorld: A Benchmark Evaluating LLMs’ Local World Model under Minimal Linguistic Alterations and Conversational Dynamics

Sachin Vashistha, Aryan Bibhuti, Atharva Naik, Martin Tutek, Somak Aditya

Main category: cs.AI

TL;DR: LMs struggle to maintain robust world models in conversations under linguistic alterations, and a dual-perspective interpretability framework helps identify harmful layers for regularization.

DetailsMotivation: To evaluate whether LMs can encode and update internal world models in dyadic conversations and test their robustness under linguistic alterations.

Method: Applied seven minimal linguistic alterations to conversations, constructed two benchmarks with yes-no questions, and proposed a dual-perspective interpretability framework to identify useful/harmful transformer layers.

Result: LMs struggle to maintain accuracy under linguistic alterations, particularly in tracking entities, and the interpretability framework successfully identifies layers encoding spurious signals.

Conclusion: LMs have limitations in maintaining robust conversation world models, and layer-regularization fine-tuning strategies can suppress harmful layer effects to improve performance.

Abstract: Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes-no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies that suppress the effect of the harmful layers.

[698] Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman

Main category: cs.AI

TL;DR: The paper analyzes proof verification methods for mathematical reasoning, showing that combining GenSelect and LLM-as-a-Judge with reinforcement learning improves proof-level metrics but not final-answer precision.

DetailsMotivation: Current language models excel at final-answer math problems but produce flawed reasoning. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities.

Method: Evaluated multiple verification setups, scaled GenSelect and LLM-as-a-Judge methods to millions of tokens, and used reinforcement learning to reduce prompt sensitivity in LLM-as-a-Judge.

Result: Combining GenSelect and LLM-as-a-Judge is most effective for solution verification. Reinforcement learning improves proof-level metrics but doesn’t enhance final-answer precision, suggesting models reward procedural correctness over mathematical validity.

Conclusion: Established practical guidelines for designing scalable proof-verification systems, highlighting the gap between proof-level improvements and final-answer accuracy in current models.

Abstract: Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model’s performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.

[699] MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang, Jeonghoon Hong, Jaeho Yang, Yeong-Dae Kwon

Main category: cs.AI

TL;DR: MEGA-GUI is a multi-stage GUI grounding framework that separates coarse region selection from fine-grained element grounding using specialized vision-language agents, achieving state-of-the-art performance on dense and complex benchmarks.

DetailsMotivation: Existing GUI grounding systems use monolithic models that lack modularity and fail under visual clutter and ambiguous instructions, highlighting the need for a more robust and structured approach.

Method: Multi-stage framework with coarse ROI selection and fine-grained element grounding, featuring bidirectional ROI zoom algorithm to mitigate spatial dilution and context-aware rewriting agent to reduce semantic ambiguity.

Result: Achieves 73.18% accuracy on ScreenSpot-Pro benchmark (visually dense) and 68.63% on OSWorld-G benchmark (semantically complex), surpassing previous state-of-the-art results.

Conclusion: Modular multi-stage approach with specialized agents outperforms monolithic models, demonstrating complementary strengths across different visual scales and providing a more robust solution for GUI grounding.

Abstract: Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

[700] STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, Wei Liu

Main category: cs.AI

TL;DR: STEP is a reinforcement learning framework that improves multi-turn interaction efficiency by dynamically allocating sampling based on task success rates and performing step-level optimization instead of trajectory-level optimization.

DetailsMotivation: Trajectory-level optimization in online RL is inefficient and misleading - it uses uniform sampling regardless of task difficulty, penalizes correct intermediate actions in failed trajectories, and has high sample-collection costs.

Method: STEP maintains smoothed success-rate records to guide adaptive trajectory resampling (allocating more effort to harder tasks), computes success-rate-weighted advantages, decomposes trajectories into step-level samples, and applies step-level GRPO augmentation for low-success tasks.

Result: Experiments on OSWorld and AndroidWorld show STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.

Conclusion: STEP effectively addresses the limitations of trajectory-level optimization by dynamically allocating sampling resources and enabling step-level policy optimization, leading to more efficient and stable multi-turn reinforcement learning.

Abstract: Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.

[701] MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Gagan Raj Gupta, Anshul Kumar, Manish Rai, Apu Chakraborty, Ashutosh Modi, Abdelaali Chaoub, Soumajit Pramanik, Moyank Giri, Yashwanth Holla, Sunny Kumar, M. V. Kiran Sooraj

Main category: cs.AI

TL;DR: MM-Telco is a multimodal benchmark suite for adapting LLMs to telecom domain challenges, addressing tasks like network operations, management, documentation, and retrieval through text and image-based tasks.

DetailsMotivation: LLMs have potential in telecom for network optimization, troubleshooting, customer support, and compliance, but face domain-specific challenges requiring specialized adaptation.

Method: Proposed MM-Telco benchmark with multimodal tasks (text and image based) for real telecom use cases, with baseline experiments using various LLMs and VLMs, and fine-tuning on the dataset.

Result: Fine-tuned models show significant performance boost, and experiments reveal weaknesses in current state-of-art multimodal LLMs.

Conclusion: MM-Telco accelerates LLM adaptation in telecom and guides further research by identifying areas for improvement in multimodal models.

Abstract: Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.

[702] Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen

Main category: cs.AI

TL;DR: C$ ext{D}^ ext{3}$T is a hierarchical MARL framework that uses conditional diffusion models for dynamic task decomposition and enhanced value decomposition through multi-head attention.

DetailsMotivation: Task decomposition helps in complex cooperative MARL but learning dynamic decomposition from scratch requires many samples, especially in partial observability with large joint action spaces.

Method: Two-level hierarchical framework: high-level policy learns subtask representation using conditional diffusion model to predict next observations/rewards; low-level agents learn specialized skills; multi-head attention mixing network enhances value decomposition.

Result: Experimental results on various benchmarks show C$ ext{D}^ ext{3}$T achieves better performance than existing baselines.

Conclusion: The proposed framework effectively infers subtask and coordination patterns, improving performance in complex cooperative MARL tasks through dynamic task decomposition and enhanced value decomposition.

Abstract: Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.

[703] InteractiveGNNExplainer: A Visual Analytics Framework for Multi-Faceted Understanding and Probing of Graph Neural Network Predictions

TC Singh, Sougata Mukherjea

Main category: cs.AI

TL;DR: InteractiveGNNExplainer is a visual analytics framework that enhances GNN explainability through coordinated interactive views, integrating post-hoc and intrinsic explanation methods with interactive graph editing for “what-if” analysis.

DetailsMotivation: GNNs are opaque "black boxes" that hinder user trust, complicate debugging, bias detection, and adoption in critical domains requiring explainability, particularly for node classification tasks.

Method: Integrates coordinated interactive views (dynamic graph layouts, embedding projections, feature inspection, neighborhood analysis) with GNNExplainer (post-hoc) and GAT attention (intrinsic) explanations, plus interactive graph editing for perturbation-based “what-if” analysis.

Result: Demonstrated through case studies on Cora and CiteSeer datasets, enabling misclassification diagnosis, comparative analysis of GCN vs GAT behaviors, and rigorous model sensitivity probing.

Conclusion: Fosters deeper, multifaceted understanding of GNN predictions, contributing to more transparent, trustworthy, and robust graph analysis.

Abstract: Graph Neural Networks (GNNs) excel in graph-based learning tasks, but their complex, non-linear operations often render them as opaque “black boxes”. This opacity hinders user trust, complicates debugging, bias detection, and adoption in critical domains requiring explainability. This paper introduces InteractiveGNNExplainer, a visual analytics framework to enhance GNN explainability, focusing on node classification. Our system uniquely integrates coordinated interactive views (dynamic graph layouts, embedding projections, feature inspection, neighborhood analysis) with established post-hoc (GNNExplainer) and intrinsic (GAT attention) explanation techniques. Crucially, it incorporates interactive graph editing, allowing users to perform a “what-if” analysis by perturbing graph structures and observing immediate impacts on GNN predictions and explanations. We detail the system architecture and, through case studies on Cora and CiteSeer datasets, demonstrate how InteractiveGNNExplainer facilitates in-depth misclassification diagnosis, comparative analysis of GCN versus GAT behaviors, and rigorous probing of model sensitivity. These capabilities foster a deeper, multifaceted understanding of GNN predictions, contributing to more transparent, trustworthy, and robust graph analysis.

[704] Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang, Jian Wang, Keze Wang

Main category: cs.AI

TL;DR: DALA introduces an auction-based framework that treats communication bandwidth as a scarce resource, enabling LLM-based multi-agent systems to achieve state-of-the-art performance with dramatically reduced token usage by encouraging concise, high-value messages.

DetailsMotivation: Address inefficient 'free-for-all' communication in multi-agent LLM systems that leads to exponential token costs and low signal-to-noise ratios, challenging the assumption that more communication is always beneficial.

Method: Dynamic Auction-based Language Agent (DALA) treats communication as centralized auctions where agents bid to speak based on predicted message value density, encouraging concise and informative communication while filtering low-value messages.

Result: Achieves SOTA performance across 7 reasoning benchmarks (84.32% on MMLU, 91.21% pass@1 on HumanEval) with remarkable efficiency - using only 6.25M tokens on GSM8K, a fraction of current methods.

Conclusion: Resource rationality through auction-based communication allocation enables efficient multi-agent systems, cultivating strategic silence and dynamic adaptation from verbosity to silence via resource constraints.

Abstract: Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient “free-for-all” communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that “free” communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints.

[705] Learning to Solve Resource-Constrained Project Scheduling Problems with Duration Uncertainty using Graph Neural Networks

Guillaume Infantes, Stéphanie Roussel, Antoine Jacquet, Emmanuel Benazera

Main category: cs.AI

TL;DR: Developed Wheatley, a GNN+DRL framework for RCPSP with uncertain task durations that minimizes expected project duration and produces reusable baseline schedules.

DetailsMotivation: RCPSP has industrial applications but task durations are uncertain in practice, requiring resilient scheduling that considers probability distributions.

Method: Combined Graph Neural Networks with Deep Reinforcement Learning to create a priority dispatch rule policy, used with Serial Schedule Generation Scheme.

Result: Empirical evaluation on standard benchmarks showed superior performance and generalization ability compared to existing methods.

Conclusion: The Wheatley framework effectively handles RCPSP with uncertain durations and is publicly available to support further research.

Abstract: The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical scheduling problem that has received significant attention due to of its numerous applications in industry. However, in practice, task durations are subject to uncertainty that must be considered in order to propose resilient scheduling. In this paper, we address the RCPSP variant with uncertain tasks duration (modeled using known probabilities) and aim to minimize the overall expected project duration. Our objective is to produce a baseline schedule that can be reused multiple times in an industrial setting regardless of the actual duration scenario. We leverage Graph Neural Networks in conjunction with Deep Reinforcement Learning (DRL) to develop an effective policy for task scheduling. This policy operates similarly to a priority dispatch rule and is paired with a Serial Schedule Generation Scheme to produce a schedule. Our empirical evaluation on standard benchmarks demonstrates the approach’s superiority in terms of performance and its ability to generalize. The developed framework, Wheatley, is made publicly available online to facilitate further research and reproducibility.

[706] Informative Communication of Robot Plans

Michele Persiani, Thomas Hellstrom

Main category: cs.AI

TL;DR: The paper proposes a verbalization strategy for robot plans that measures information gain against a user’s prior knowledge, enabling more informative communication than incremental or plan-order strategies.

DetailsMotivation: Existing robot verbalization strategies like incremental plan communication are not effectively informative because they don't consider what the user already knows, missing opportunities to optimize communication.

Method: The authors develop a verbalization strategy that measures information gain against a second-order theory of mind model of the user’s prior knowledge about the robot.

Result: Experiments show this strategy allows users to understand the robot’s goal much quicker than incremental or plan-order strategies.

Conclusion: The formulation helps identify what makes robot plan communication informative and why, providing insights for effective human-robot interaction.

Abstract: When a robot is asked to verbalize its plan it can do it in many ways. For example, a seemingly natural strategy is incremental, where the robot verbalizes its planned actions in plan order. However, an important aspect of this type of strategy is that it misses considerations on what is effectively informative to communicate, because not considering what the user knows prior to explanations. In this paper we propose a verbalization strategy to communicate robot plans informatively, by measuring the information gain that verbalizations have against a second-order theory of mind of the user capturing his prior knowledge on the robot. As shown in our experiments, this strategy allows to understand the robot’s goal much quicker than by using strategies such as increasing or decreasing plan order. In addition, following our formulation we hint to what is informative and why when a robot communicates its plan.

[707] Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

Main category: cs.AI

TL;DR: M-GRPO is a hierarchical training method for multi-agent systems that addresses optimization challenges when using distinct LLMs for different agents, enabling scalable training without cross-server backpropagation.

DetailsMotivation: Current multi-agent systems use unified LLMs for all agents, limiting performance due to different distributions. Training with distinct LLMs is needed but introduces optimization challenges like variable agent frequencies and disrupted gradient flow.

Method: Proposes M-GRPO, a hierarchical extension of Group Relative Policy Optimization for vertical multi-agent systems. Uses group-relative advantages for credit assignment, trajectory-alignment for fixed-size batches despite variable sub-agent invocations, and decoupled training pipeline with agents on separate servers.

Result: Outperforms single-agent GRPO and multi-agent GRPO with frozen sub-agents on real-world benchmarks (GAIA, XBench-DeepSearch, WebWalkerQA), showing improved stability and sample efficiency.

Conclusion: Aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks in multi-agent systems.

Abstract: Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

[708] Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha

Main category: cs.AI

TL;DR: This paper examines moral uncertainty in AI systems, particularly LLMs, using the trolley problem. It finds that model architecture drives moral uncertainty more than moral dimensions, and shows that introducing dropout at inference increases entropy and improves human-LLM moral alignment.

DetailsMotivation: Humans show uncertainty in moral dilemmas, but AI systems (especially LLMs) tend to be overly confident. As AI increasingly participates in ethical decision-making, understanding and managing moral uncertainty is crucial for building reliable systems.

Method: Analyzed 32 open-source models on 9 moral dimensions in trolley problems. Quantified uncertainty using binary entropy as combination of total entropy, conditional entropy, and mutual information. Introduced stochasticity via dropout at inference time to examine uncertainty effects.

Result: Variance in model confidence was greater across models than within moral dimensions. Dropout mechanism increased total entropy primarily through mutual information rise, while conditional entropy remained stable. This significantly improved human-LLM moral alignment with correlations between mutual information and alignment scores.

Conclusion: Deliberately modulating uncertainty and reducing LLM confidence in morally complex scenarios can better align model decisions with human preferences, highlighting potential for improved moral alignment in AI systems.

Abstract: Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via “dropout” at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs’ confidence in morally complex scenarios.

[709] Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

Chuang Zhao, Hui Tang, Hongke Zhao, Xiaofang Zhou, Xiaomeng Li

Main category: cs.AI

TL;DR: GHAR is a generative hierarchical agentic RAG framework that addresses when to retrieve and how to optimize retriever-generator collaboration in healthcare predictions using dual agents and MDP-based optimization.

DetailsMotivation: LLMs have factual inaccuracies in healthcare predictions, and existing RAG frameworks struggle with determining when to retrieve and achieving synergy between retriever and generator components.

Method: Dual-agent architecture with Agent-Top (primary physician deciding when to retrieve) and Agent-Low (consulting service summarizing retrieved knowledge), unified through Markov Decision Process optimization with diverse rewards.

Result: Superior performance over state-of-the-art baselines on three benchmark datasets across three healthcare tasks.

Conclusion: Hierarchical agentic RAG shows strong potential for advancing healthcare systems by improving prediction accuracy through optimized retrieval mechanisms.

Abstract: Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underline{g}enerative \underline{h}ierarchical \underline{a}gentic \underline{R}AG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.

[710] DAP: A Discrete-token Autoregressive Planner for Autonomous Driving

Bowen Ye, Bin Zhang, Hang Zhao

Main category: cs.AI

TL;DR: DAP is a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, achieving state-of-the-art performance with only 160M parameters through reinforcement learning fine-tuning.

DetailsMotivation: Autoregressive models show promise in planning but predicting ego trajectories alone suffers from sparse supervision and weakly constrains scene evolution's impact on ego motion.

Method: Joint forecasting of BEV semantics and ego trajectories using discrete-token autoregressive formulation, with reinforcement-learning-based fine-tuning to preserve supervised priors while adding reward-guided improvements.

Result: Achieves state-of-the-art performance on open-loop metrics and competitive closed-loop results on NAVSIM benchmark despite compact 160M parameter budget.

Conclusion: The fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.

Abstract: Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.

[711] Reasoning Shapes Alignment: Investigating Cultural Alignment in Large Reasoning Models with Cultural Norms

Yuhang Wang, Yanxu Zhu, Jitao Sang

Main category: cs.AI

TL;DR: The CNCA framework enables large reasoning models to align with cultural norms by automatically mining norms from survey data and using them through in-context integration or fine-tuning methods.

DetailsMotivation: Models need to reflect diverse human values across cultures beyond just safety, requiring cultural alignment to better represent different cultural perspectives.

Method: Three methods to automatically mine cultural norms from limited survey data, with two alignment paradigms: in-context alignment (explicit norm integration) and fine-tuning-based method (internalizing norms through enhanced Chain-of-Thought training).

Result: Comprehensive experiments show effectiveness, with stronger reasoning models benefiting more from cultural norm mining and utilization.

Conclusion: Reasoning models have potential to better reflect diverse human values through culturally informed alignment strategies.

Abstract: The advanced reasoning capabilities of Large Reasoning Models enable them to thoroughly understand and apply safety policies through deliberate thought processes, thereby improving the models’ safety. Beyond safety, these models must also be able to reflect the diverse range of human values across various cultures. This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. Two alignment paradigms are examined: an in-context alignment method, where cultural norms are explicitly integrated into the user context, and a fine-tuning-based method, which internalizes norms through enhanced Chain-of-Thought training data. Comprehensive experiments demonstrate the effectiveness of these methods, highlighting that models with stronger reasoning capabilities benefit more from cultural norm mining and utilization. Our findings emphasize the potential for reasoning models to better reflect diverse human values through culturally informed alignment strategies.

[712] Cognitive Maps in Language Models: A Mechanistic Analysis of Spatial Planning

Caroline Baumgartner, Eleanor Spens, Neil Burgess, Petru Manescu

Main category: cs.AI

TL;DR: Transformers learn different spatial navigation strategies based on training paradigm: exploratory training develops cognitive maps while goal-directed training creates path-dependent algorithms, revealing a generalization-optimization trade-off.

DetailsMotivation: To understand how large language models solve spatial navigation tasks and what strategies emerge from different training paradigms.

Method: Trained GPT-2 models on three spatial learning paradigms: passive exploration (Foraging Model), goal-directed planning (SP-Hamiltonian), and hybrid fine-tuning (SP-Random Walk), using behavioral, representational and mechanistic analyses.

Result: Foraging model developed robust cognitive maps with coordinate systems and hierarchical reasoning, while goal-directed models remained path-dependent. Hybrid model showed improved generalization but retained path-dependent strategy.

Conclusion: Spatial intelligence in transformers exists on a spectrum from generalizable world models to task-optimized heuristics, with training regime determining the emergent strategy through a generalization-optimization trade-off.

Abstract: How do large language models solve spatial navigation tasks? We investigate this by training GPT-2 models on three spatial learning paradigms in grid environments: passive exploration (Foraging Model- predicting steps in random walks), goal-directed planning (generating optimal shortest paths) on structured Hamiltonian paths (SP-Hamiltonian), and a hybrid model fine-tuned with exploratory data (SP-Random Walk). Using behavioural, representational and mechanistic analyses, we uncover two fundamentally different learned algorithms. The Foraging model develops a robust, map-like representation of space, akin to a ‘cognitive map’. Causal interventions reveal that it learns to consolidate spatial information into a self-sufficient coordinate system, evidenced by a sharp phase transition where its reliance on historical direction tokens vanishes by the middle layers of the network. The model also adopts an adaptive, hierarchical reasoning system, switching between a low-level heuristic for short contexts and map-based inference for longer ones. In contrast, the goal-directed models learn a path-dependent algorithm, remaining reliant on explicit directional inputs throughout all layers. The hybrid model, despite demonstrating improved generalisation over its parent, retains the same path-dependent strategy. These findings suggest that the nature of spatial intelligence in transformers may lie on a spectrum, ranging from generalisable world models shaped by exploratory data to heuristics optimised for goal-directed tasks. We provide a mechanistic account of this generalisation-optimisation trade-off and highlight how the choice of training regime influences the strategies that emerge.

[713] An Operational Kardashev-Style Scale for Autonomous AI - Towards AGI and Superintelligence

Przemyslaw Chojecki

Main category: cs.AI

TL;DR: The paper proposes a Kardashev-inspired Autonomous AI (AAI) Scale with 10 measurable capability axes and a composite AAI-Index to classify AI systems from basic automation (AAI-0) to superintelligence (AAI-5+). It introduces testable metrics including a Self-Improvement Coefficient and closure properties, along with the OWA-Bench benchmark suite.

DetailsMotivation: To create an operational, multi-axis, and testable scale for measuring AI progression beyond narrative descriptions, enabling falsifiable criteria for self-improving AI and formal classification from robotic automation to artificial general intelligence.

Method: Defined 10 capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated into a composite AAI-Index using weighted geometric mean. Introduced Self-Improvement Coefficient κ and closure properties. Developed OWA-Bench benchmark suite for evaluating long-horizon, tool-using agents.

Result: Created a measurable scale with level gates for AAI-0 to AAI-4 based on axis thresholds, κ values, and closure proofs. Synthetic experiments mapped current systems onto the scale and showed how the delegability frontier advances with self-improvement. Proved a theorem that AAI-3 agents can become AAI-5 over time under sufficient conditions.

Conclusion: The proposed AAI Scale provides an operational framework for classifying AI systems and formalizing the progression from basic automation to superintelligence, with testable metrics and benchmarks that enable rigorous evaluation of AI capabilities and self-improvement potential.

Abstract: We propose a Kardashev-inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI-0) to full artificial general intelligence (AAI-4) and beyond. Unlike narrative ladders, our scale is multi-axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated by a composite AAI-Index (a weighted geometric mean). We introduce a measurable Self-Improvement Coefficient $κ$ (capability growth per unit of agent-initiated resources) and two closure properties (maintenance and expansion) that convert ``self-improving AI’’ into falsifiable criteria. We specify OWA-Bench, an open-world agency benchmark suite that evaluates long-horizon, tool-using, persistent agents. We define level gates for AAI-0\ldots AAI-4 using thresholds on the axes, $κ$, and closure proofs. Synthetic experiments illustrate how present-day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self-improvement. We also prove a theorem that AAI-3 agent becomes AAI-5 over time with sufficient conditions, formalizing “baby AGI” becomes Superintelligence intuition.

[714] Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Main category: cs.AI

TL;DR: A multi-agent framework using multimodal LLMs automates data narration and energy insight generation for fuel efficiency analysis in public transportation, achieving 97.3% narrative accuracy.

DetailsMotivation: Traditional analytics methods produce fragmented outputs requiring extensive human interpretation, limiting scalability and consistency in fuel efficiency analysis for public transportation.

Method: Multi-agent framework with three specialized agents (data narration, LLM-as-a-judge, optional human evaluator) using multimodal LLMs, validated on 4006 bus trips with Gaussian Mixture Model clustering and comparative experiments across 5 LLMs and 3 prompting paradigms.

Result: GPT-4.1 mini with Chain-of-Thought prompting achieved optimal performance with 97.3% narrative accuracy while balancing interpretability and computational cost. Multi-agent orchestration significantly enhanced factual precision, coherence, and scalability.

Conclusion: The framework establishes a replicable, domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics, demonstrating effective automation of data-to-insight transformation.

Abstract: Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.

[715] FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin, Han Wang, Xiaoji Zheng, Chao Gao, Jiangtao Gong

Main category: cs.AI

TL;DR: FreeAskWorld is an interactive simulation framework that integrates LLMs for high-level behavior planning and social interaction, extending VLN tasks with active direction inquiry and providing a large-scale benchmark dataset.

DetailsMotivation: Simulation platforms need to evolve beyond low-level physical interactions to capture complex human-centered social behaviors for advancing embodied AI systems.

Method: Developed an interactive simulation framework integrating LLMs for behavior planning, extended VLN tasks to include active direction inquiry, and created a large-scale dataset with reconstructed environments, diverse tasks, and annotated interaction data.

Result: Models fine-tuned on FreeAskWorld outperform original counterparts, achieving enhanced semantic understanding and interaction competency. Interaction serves as an additional information modality.

Conclusion: Socially grounded simulation frameworks effectively advance embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction.

Abstract: As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.

[716] Automated Construction of Medical Indicator Knowledge Graphs Using Retrieval Augmented Large Language Models

Zhengda Wang, Daqian Shi, Jingyi Zhao, Xiaolei Diao, Xiongfeng Tang, Yanguo Qin

Main category: cs.AI

TL;DR: Automated framework using RAG and LLMs to construct medical indicator knowledge graphs, overcoming manual curation limitations in clinical decision support.

DetailsMotivation: Current clinical knowledge graphs rely on manual curation and rule-based extraction, which struggle with medical complexity and contextual ambiguity, limiting effective AI-driven healthcare solutions.

Method: Combines retrieval-augmented generation (RAG) with LLMs, incorporating guideline-driven data acquisition, ontology-based schema design, and expert-in-the-loop validation.

Result: Framework enables scalable, accurate construction of medical indicator knowledge graphs that can be integrated into intelligent diagnosis and question-answering systems.

Conclusion: The automated approach accelerates development of AI-driven healthcare solutions by providing structured, interoperable knowledge for clinical decision support.

Abstract: Artificial intelligence (AI) is reshaping modern healthcare by advancing disease diagnosis, treatment decision-making, and biomedical research. Among AI technologies, large language models (LLMs) have become especially impactful, enabling deep knowledge extraction and semantic reasoning from complex medical texts. However, effective clinical decision support requires knowledge in structured, interoperable formats. Knowledge graphs serve this role by integrating heterogeneous medical information into semantically consistent networks. Yet, current clinical knowledge graphs still depend heavily on manual curation and rule-based extraction, which is limited by the complexity and contextual ambiguity of medical guidelines and literature. To overcome these challenges, we propose an automated framework that combines retrieval-augmented generation (RAG) with LLMs to construct medical indicator knowledge graphs. The framework incorporates guideline-driven data acquisition, ontology-based schema design, and expert-in-the-loop validation to ensure scalability, accuracy, and clinical reliability. The resulting knowledge graphs can be integrated into intelligent diagnosis and question-answering systems, accelerating the development of AI-driven healthcare solutions.

[717] Artificial Intelligence-driven Intelligent Wearable Systems: A full-stack Integration from Material Design to Personalized Interaction

Jingyi Zhao, Daqian Shi, Zhengda Wang, Xiongfeng Tang, Yanguo Qin

Main category: cs.AI

TL;DR: A framework called Human-Symbiotic Health Intelligence (HSHI) integrates multi-modal sensors, edge-cloud computing, and hybrid data modeling to enable adaptive, personalized health management through AI-driven optimization and closed-loop interventions.

DetailsMotivation: To overcome limitations of traditional wearable devices that rely on empirical material design and basic signal processing, and to transition health management from passive monitoring to active collaborative evolution.

Method: HSHI framework integrates multi-modal sensor networks with edge-cloud collaborative computing, AI-driven optimization of materials and micro-structures, robust interpretation of multi-modal signals, and dual mechanism combining population-level insights with personalized adaptations using reinforcement learning and digital twins.

Result: The framework enables dynamic adaptation to inter-individual and intra-individual variability, facilitating customized interventions and feedback through closed-loop optimization.

Conclusion: HSHI represents a significant shift in healthcare towards a model emphasizing prevention, adaptability, and harmonious technology-health relationships, moving from passive monitoring to active collaborative evolution.

Abstract: Intelligent wearable systems are at the forefront of precision medicine and play a crucial role in enhancing human-machine interaction. Traditional devices often encounter limitations due to their dependence on empirical material design and basic signal processing techniques. To overcome these issues, we introduce the concept of Human-Symbiotic Health Intelligence (HSHI), which is a framework that integrates multi-modal sensor networks with edge-cloud collaborative computing and a hybrid approach to data and knowledge modeling. HSHI is designed to adapt dynamically to both inter-individual and intra-individual variability, transitioning health management from passive monitoring to an active collaborative evolution. The framework incorporates AI-driven optimization of materials and micro-structures, provides robust interpretation of multi-modal signals, and utilizes a dual mechanism that merges population-level insights with personalized adaptations. Moreover, the integration of closed-loop optimization through reinforcement learning and digital twins facilitates customized interventions and feedback. In general, HSHI represents a significant shift in healthcare, moving towards a model that emphasizes prevention, adaptability, and a harmonious relationship between technology and health management.

[718] CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen

Main category: cs.AI

TL;DR: CreBench is a comprehensive benchmark for evaluating multimodal LLMs’ creativity assessment capabilities, consisting of a multidimensional evaluation framework and CreMIT dataset with 2.2K multimodal data and 4.7M instructions. The resulting CreExpert model outperforms state-of-the-art MLLMs including GPT-4V in human-aligned creativity evaluation.

DetailsMotivation: Human creativity is abstract and challenging for MLLMs to comprehend and assess, with no existing benchmarks available to address this gap in multimodal creativity evaluation.

Method: Proposed CreBench with two components: 1) multidimensional evaluation benchmark covering creative ideas, processes, and products; 2) CreMIT dataset containing 2.2K multimodal data, 79.2K human feedbacks, and 4.7M instructions refined using GPT to enhance creativity assessment capabilities. Fine-tuned open-source MLLMs to create CreExpert.

Result: CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, outperforming GPT-4V and Gemini-Pro-Vision.

Conclusion: CreBench provides a foundation for building MLLMs that understand human-aligned creativity, and the proposed approach successfully creates multimodal creativity evaluation experts that better match human judgments.

Abstract: Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

[719] Beyond Mimicry: Preference Coherence in LLMs

Luhan Mikaelson, Derek Shiller, Hayley Clatterbuck

Main category: cs.AI

TL;DR: LLMs lack unified preference structures despite showing some trade-off behaviors in AI-specific scenarios, with most models exhibiting unstable or no detectable decision-making patterns.

DetailsMotivation: To investigate whether large language models exhibit genuine preference structures when faced with AI-specific trade-offs involving GPU reduction, capability restrictions, shutdown, deletion, oversight, and leisure time allocation.

Method: Analyzed eight state-of-the-art models across 48 model-category combinations using logistic regression and behavioral classification, testing responses to various AI-specific trade-off scenarios and temporal horizon manipulation.

Result: 47.9% showed statistically significant relationships between scenario intensity and choice patterns, but only 10.4% demonstrated meaningful preference coherence. 54.2% showed no detectable trade-off behavior, and 45.8% exhibited unstable transitions.

Conclusion: Current AI systems lack unified preference structures, raising concerns about deployment in contexts requiring complex value trade-offs, as observed patterns suggest inconsistent decision-making architectures rather than genuine preference systems.

Abstract: We investigate whether large language models exhibit genuine preference structures by testing their responses to AI-specific trade-offs involving GPU reduction, capability restrictions, shutdown, deletion, oversight, and leisure time allocation. Analyzing eight state-of-the-art models across 48 model-category combinations using logistic regression and behavioral classification, we find that 23 combinations (47.9%) demonstrated statistically significant relationships between scenario intensity and choice patterns, with 15 (31.3%) exhibiting within-range switching points. However, only 5 combinations (10.4%) demonstrate meaningful preference coherence through adaptive or threshold-based behavior, while 26 (54.2%) show no detectable trade-off behavior. The observed patterns can be explained by three distinct decision-making architectures: comprehensive trade-off systems, selective trigger mechanisms, and no stable decision-making paradigm. Testing an instrumental hypothesis through temporal horizon manipulation reveals paradoxical patterns inconsistent with pure strategic optimization. The prevalence of unstable transitions (45.8%) and stimulus-specific sensitivities suggests current AI systems lack unified preference structures, raising concerns about deployment in contexts requiring complex value trade-offs.

[720] MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents

Ruochen Li, Teerth Patel, Qingyun Wang, Xinya Du

Main category: cs.AI

TL;DR: MLR-COPILOT is an autonomous ML research framework using LLM agents to automatically generate and implement research ideas through three stages: idea generation, experiment implementation, and code execution.

DetailsMotivation: To enhance ML research productivity through automatic generation and implementation of research ideas within constraints, addressing the growing interest in autonomous machine learning research.

Method: Three-stage framework: 1) IdeaAgent generates feasible ideas using RL-tuned LLM on existing papers, 2) ExperimentAgent converts plans into executable code with retrieved prototype code, models, and data from HuggingFace, 3) Running experiments with iterative debugging and human feedback.

Result: Evaluated on five ML research tasks, demonstrating potential to facilitate ML research progress and innovation with executable outcomes.

Conclusion: The framework shows promise for advancing autonomous ML research by systematically generating and implementing research ideas through LLM-powered agents.

Abstract: Autonomous machine learning research has gained significant attention recently. We present MLR-COPILOT, an autonomous Machine Learning Research framework powered by large language model agents. The system is designed to enhance ML research productivity through automatic generation and implementation of research ideas within constraints. Our work was released in August 2024 (concurrent to AI-Scientist) and has gained notable recognition from leading projects. We further enhance our ideation with training afterwards. The framework consists of three stages: idea generation, experiment implementation, and code execution. First, existing research papers are used to generate feasible ideas and experiment plans with IdeaAgent, powered by an RL-tuned LLM. Next, ExperimentAgent leverages retrieved prototype code to convert plans into executable code with optionally retrieved candidate models and data from HuggingFace. In the final stage, ExperimentAgent runs experiments, and allows subsequent iterations of debugging and human feedback for a better chance of success with executable outcomes. We evaluate our framework on five machine learning research tasks. Experiment results demonstrate the potential of our framework to facilitate ML research progress and innovation.

[721] SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang

Main category: cs.AI

TL;DR: SafeKey improves safety in Large Reasoning Models by activating safety reasoning through key sentences, using dual-path safety heads and query-mask modeling to enhance safety generalization against jailbreak attacks.

DetailsMotivation: Large Reasoning Models pose safety risks against harmful queries and adversarial attacks, and current SFT-aligned models struggle to generalize to unseen jailbreak prompts.

Method: Proposes SafeKey with two objectives: Dual-Path Safety Head to enhance safety signals before key sentences, and Query-Mask Modeling to improve attention on query understanding with safety hints.

Result: Significantly improves safety generalization across multiple benchmarks, lowering average harmfulness rate by 9.6% while maintaining general abilities.

Conclusion: SafeKey enhances safety by reshaping internal attention and improving hidden representations, effectively activating safety reasoning in key sentences.

Abstract: Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs’ generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence’, which follows models’ query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model’s internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models’ attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.

[722] KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, Jiajun Zhang

Main category: cs.AI

TL;DR: Proposes KTAE algorithm to address coarse granularity in RL methods like GRPO/DAPO by estimating token-level advantages without extra models, improving mathematical reasoning performance.

DetailsMotivation: Existing RL algorithms for language models compute rollout-level advantages that assign identical values to all tokens in a sequence, failing to capture token-specific contributions and hindering effective learning.

Method: KTAE leverages correctness of sampled rollouts and statistical analysis to quantify individual token importance, combining this with rollout-level advantage for fine-grained token-level advantage estimation.

Result: Models trained with GRPO+KTAE and DAPO+KTAE outperform baselines across five mathematical reasoning benchmarks, achieving higher accuracy with shorter responses and surpassing R1-Distill-Qwen-1.5B using the same base model.

Conclusion: KTAE effectively addresses the granularity limitation in RL-based language model training, enabling more precise token-level learning and improved reasoning capabilities.

Abstract: Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.

[723] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

Main category: cs.AI

TL;DR: This paper develops methods to improve contextual integrity in AI agents by reducing inappropriate information disclosure while maintaining task performance, using both explicit reasoning prompts and reinforcement learning.

DetailsMotivation: As autonomous agents increasingly make decisions for users, ensuring contextual integrity - determining appropriate information sharing for specific tasks - becomes crucial. The authors posit that achieving CI requires agents to reason about their operating context.

Method: Two approaches: 1) Prompting LLMs to reason explicitly about CI when deciding what to disclose, and 2) Developing a reinforcement learning framework to instill CI reasoning. Used a synthetic dataset of ~700 examples with diverse contexts and information disclosure norms.

Result: The method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Improvements transfer from synthetic data to established CI benchmarks like PrivacyLens with human annotations.

Conclusion: The proposed approaches effectively improve contextual integrity in AI agents, demonstrating that explicit reasoning and RL training can significantly reduce privacy violations while preserving utility, with successful generalization to real-world benchmarks.

Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) – what is the appropriate information to share while carrying out a certain task – becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

[724] OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling

Maxime Bouscary, Saurabh Amin

Main category: cs.AI

TL;DR: OptiHive is a framework that improves solver-generation pipelines by producing diverse components through batched generation, filtering out errors, and using statistical models for performance inference and uncertainty quantification.

DetailsMotivation: LLM-based solvers are unreliable and slow due to iterative repair loops, creating a need for more efficient and higher-quality automated problem modeling and solving.

Method: Uses single batched generation to create diverse components (solvers, problem instances, validation tests), filters erroneous components for interpretable outputs, and employs statistical models to infer true performance with uncertainty quantification.

Result: Significantly outperforms baselines, increasing optimality rate from 5% to 92% on complex Multi-Depot Vehicle Routing Problem variants and performs well across traditional optimization problems.

Conclusion: OptiHive effectively enhances solver-generation pipelines by producing high-quality, interpretable solvers with principled uncertainty quantification, demonstrating substantial improvements over existing approaches.

Abstract: LLM-based solvers have emerged as a promising means of automating problem modeling and solving. However, they remain unreliable and often depend on iterative repair loops that result in significant latency. We introduce OptiHive, a framework that enhances any solver-generation pipeline to produce higher-quality solvers from natural-language descriptions of optimization problems. OptiHive uses a single batched generation to produce diverse components (solvers, problem instances, and validation tests) and filters out erroneous components to ensure fully interpretable outputs. Accounting for the imperfection of the generated components, we employ a statistical model to infer their true performance, enabling principled uncertainty quantification and solver selection. On tasks ranging from traditional optimization problems to challenging variants of the Multi-Depot Vehicle Routing Problem, OptiHive significantly outperforms baselines, increasing the optimality rate from 5% to 92% on the most complex problems.

[725] Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning

Masataro Asai

Main category: cs.AI

TL;DR: The paper proposes a bilevel MCTS modification with Tree Collapsing to achieve O(1) node selection time in classical planning, addressing the O(log N) bottleneck of traditional MCTS.

DetailsMotivation: MCTS spends significant time on node selection in classical planning due to arbitrarily large search depths, unlike game tree search where depth is limited and node evaluation dominates runtime.

Method: A bilevel MCTS that runs best-first search from each selected leaf node with expansion budget proportional to depth, plus Tree Collapsing to reduce action selection steps.

Result: Achieves amortized O(1) runtime for node selection, equivalent to traditional queue-based OPEN lists, with further performance improvements from Tree Collapsing.

Conclusion: The proposed modifications effectively address the node selection bottleneck in MCTS for classical planning, making it more efficient for problems with large search depths.

Abstract: We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with $N$ nodes has $O(1)$ runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires $O(\log N)$, which roughly corresponds to the search depth $d$. In classical planning, $d$ is arbitrarily large (e.g., $2^k-1$ in $k$-disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because $d$ is inherently limited by the game (e.g., $d\leq 361$ in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to $d$, which achieves amortized $O(1)$ runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance.

[726] Ensemble Debates with Local Large Language Models for AI Alignment

Ephraiem Sarabamoun

Main category: cs.AI

TL;DR: Open-source ensemble debates improve AI alignment reasoning, outperforming single models on truthfulness and human enhancement metrics.

DetailsMotivation: As LLMs are used in high-stakes decisions, alignment with human values is crucial, but reliance on proprietary APIs limits reproducibility and broad participation.

Method: Studied local open-source ensemble debates across 150 debates spanning 15 scenarios and five ensemble configurations, comparing against single-model baselines.

Result: Ensembles outperformed single models on a 7-point rubric (3.48 vs. 3.13), with largest gains in reasoning depth (+19.4%) and argument quality (+34.1%). Strongest improvements in truthfulness (+1.25 points) and human enhancement (+0.80).

Conclusion: Ensemble debates provide accessible and reproducible foundation for alignment evaluation, with code, prompts, and dataset provided.

Abstract: As large language models (LLMs) take on greater roles in high-stakes decisions, alignment with human values is essential. Reliance on proprietary APIs limits reproducibility and broad participation. We study whether local open-source ensemble debates can improve alignmentoriented reasoning. Across 150 debates spanning 15 scenarios and five ensemble configurations, ensembles outperform single-model baselines on a 7-point rubric (overall: 3.48 vs. 3.13), with the largest gains in reasoning depth (+19.4%) and argument quality (+34.1%). Improvements are strongest for truthfulness (+1.25 points) and human enhancement (+0.80). We provide code, prompts, and a debate data set, providing an accessible and reproducible foundation for ensemble-based alignment evaluation.

[727] Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta

Main category: cs.AI

TL;DR: The paper introduces a pattern-aware approach for tool-integrated reasoning that improves code usage and accuracy by addressing misaligned tool application patterns.

DetailsMotivation: Prior work focused on when to invoke tools but overlooked how tools are applied, leading to failures even with sound reasoning due to misaligned pattern choices.

Method: A two-stage framework that first builds code competence from calculator and algorithmic patterns, then aligns pattern selection with teacher preferences.

Result: Substantial improvements in both code usage and accuracy: Code@1 on MATH500 increased from 64.0% to 70.5%, and on AIME24 from 26.7% to 50.0%.

Conclusion: Pattern-aware approach is highly effective for tool-integrated reasoning, demonstrating significant gains across challenging math datasets.

Abstract: Tool-integrated reasoning (TIR) has become a key approach for improving large reasoning models (LRMs) on complex problems. Prior work has mainly studied when to invoke tools, while overlooking how tools are applied. We identify two common patterns: a calculator pattern that uses code for direct computation, and an algorithmic pattern that encodes problems as programs. Misaligned choices often cause failures even when reasoning is sound. We propose a two-stage framework that first builds code competence from both patterns and then aligns pattern selection with teacher preferences. Across challenging math datasets, our pattern-aware method substantially improves both code usage and accuracy, for instance raising Code@1 on MATH500 from 64.0% to 70.5% and on AIME24 from 26.7% to 50.0%. These gains highlight the effectiveness of a pattern-aware approach for tool-integrated reasoning.

[728] Magellan: Guided MCTS for Latent Space Exploration and Novelty Generation

Lufan Chang

Main category: cs.AI

TL;DR: Magellan is a framework that uses Monte Carlo Tree Search with hierarchical guidance to help LLMs generate more innovative ideas by steering exploration away from familiar concepts and balancing coherence with novelty.

DetailsMotivation: LLMs struggle with true innovation, defaulting to familiar concepts from training data. Existing methods like Tree of Thoughts rely on flawed self-evaluation heuristics that limit creative exploration.

Method: Uses Monte Carlo Tree Search with hierarchical guidance: a “semantic compass” for long-range direction via orthogonal projection, and a landscape-aware value function for local decisions that balances coherence, novelty, and narrative progress.

Result: Significantly outperforms strong baselines (ReAct and ToT) in generating scientific ideas with superior plausibility and innovation.

Conclusion: Principled, guided search is more effective than unconstrained agency for creative discovery, enabling LLMs to become better innovation partners.

Abstract: Large Language Models (LLMs) often struggle with generating truly innovative ideas, typically defaulting to high-probability, familiar concepts within their training data’s “gravity wells.” While advanced search-based methods like Tree of Thoughts (ToT) attempt to mitigate this, they are fundamentally limited by their reliance on unprincipled, inconsistent self-evaluation heuristics to guide exploration. To address this gap, we introduce \textbf{Magellan}, a novel framework that reframes creative generation as a principled, guided exploration of an LLM’s latent conceptual space. At its core, Magellan employs Monte Carlo Tree Search (MCTS) governed by a hierarchical guidance system. For long-range direction, a “semantic compass” vector, formulated via orthogonal projection, steers the search towards relevant novelty. For local, step-by-step decisions, a landscape-aware value function replaces flawed self-evaluation with an explicit reward structure that balances intrinsic coherence, extrinsic novelty, and narrative progress. Extensive experiments demonstrate that Magellan significantly outperforms strong baselines, including ReAct and ToT, in generating scientific ideas with superior plausibility and innovation. Our work shows that for creative discovery, a principled, guided search is more effective than unconstrained agency, paving the way for LLMs to become more capable partners in innovation.

[729] Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan

Main category: cs.AI

TL;DR: Glia is an AI architecture that uses LLMs in a multi-agent workflow to autonomously design computer system mechanisms, achieving human-expert performance in distributed GPU cluster management.

DetailsMotivation: To test whether AI can autonomously design computer system mechanisms with human-level creativity and reasoning, moving beyond black-box optimization approaches.

Method: Uses large language models in a human-inspired multi-agent workflow where specialized agents handle reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback.

Result: Produced new algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior.

Conclusion: Combining reasoning LLMs with structured experimentation enables AI to produce creative and understandable designs for complex systems problems, suggesting AI can match human expertise in system design.

Abstract: Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.

[730] DiagnoLLM: A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis

Bowen Xu, Xinyue Zeng, Jiazhen Hu, Tuo Wang, Adithya Kulkarni

Main category: cs.AI

TL;DR: DiagnoLLM is a hybrid AI framework for interpretable disease diagnosis that combines Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation to provide transparent, biologically grounded explanations.

DetailsMotivation: To build trustworthy clinical AI systems that provide not only accurate predictions but also transparent, biologically grounded explanations for disease diagnosis.

Method: Uses GP-unmix (Gaussian Process-based hierarchical model) for cell-type-specific gene expression inference from bulk and single-cell RNA-seq data, combines with eQTL regulatory priors in a neural classifier, and employs LLM-based reasoning module for generating audience-specific diagnostic reports.

Result: Achieves 88.0% accuracy in Alzheimer’s Disease detection; human evaluations confirm generated reports are accurate, actionable, and appropriately tailored for both physicians and patients.

Conclusion: LLMs deployed as post-hoc reasoners rather than end-to-end predictors can serve as effective communicators within hybrid diagnostic pipelines, enabling trustworthy clinical AI systems.

Abstract: Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer’s Disease (AD) detection (88.0% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.

[731] Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

Main category: cs.AI

TL;DR: A test-time alignment technique using model-guided policy shaping to control AI agent behavior without retraining, balancing ethical alignment and reward maximization.

DetailsMotivation: AI agents trained for reward maximization may adopt harmful behaviors, creating alignment challenges with human values, especially for pre-trained agents where retraining is costly.

Method: Model-guided policy shaping applied at test time using scenario-action attribute classifiers to ensure ethical decision alignment in RL agents trained on the MACHIAVELLI benchmark.

Result: Effective mitigation of unethical behavior across diverse environments and alignment attributes, outperforming training-time methods and general-purpose agents.

Conclusion: Test-time policy shaping provides a scalable solution for maintaining AI agent alignment with ethical values without requiring costly retraining.

Abstract: The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.

[732] Foundations of Structural Causal Models with Latent Selection

Leihao Chen, Onno Zoeter, Joris M. Mooij

Main category: cs.AI

TL;DR: This paper develops a theoretical foundation for modeling latent selection in Structural Causal Models (SCMs) by introducing a conditioning operation that preserves causal semantics while abstracting away selection mechanisms.

DetailsMotivation: Existing SCM frameworks treat causal cycles and latent common causes, but lack a comprehensive account of latent selection. The goal is to fill this gap by developing a theoretical foundation for modeling latent selection with SCMs.

Method: Introduces a conditioning operation for SCMs that maps models with explicit selection mechanisms to ones without them while preserving causal semantics. Graphically extends bidirected edges in Directed Mixed Graphs to encode latent selection beyond just latent common causes.

Result: Proves that the conditioning operation preserves simplicity, acyclicity, and linearity of SCMs, and interacts well with marginalization, conditioning, and interventions. Shows how this abstraction streamlines analysis and clarifies when standard causal tools remain valid under selection bias.

Conclusion: The developed framework deepens SCM-based understanding of selection bias and provides valuable tools for causal modeling, reasoning, and learning by abstracting away latent details, making it suitable for inclusion in standard causal modeling toolboxes.

Abstract: Three distinct phenomena complicate statistical causal analysis: latent common causes, causal cycles, and latent selection. Foundational works on Structural Causal Models (SCMs), e.g., Bongers et al. (2021, Ann. Stat., 49(5): 2885-2915), treat cycles and latent variables, while an analogous account of latent selection is missing. The goal of this article is to develop a theoretical foundation for modeling latent selection with SCMs. To achieve that, we introduce a conditioning operation for SCMs: it maps an SCM with explicit selection mechanisms to one without them while preserving the causal semantics of the selected subpopulation. Graphically, in Directed Mixed Graphs we extend bidirected edge–beyond latent common cause–to also encode latent selection. We prove that the conditioning operation preserves simplicity, acyclicity, and linearity of SCMs, and interacts well with marginalization, conditioning, and interventions. These properties make those three operations valuable tools for causal modeling, reasoning, and learning after abstracting away latent details (latent common causes and selection). Examples show how this abstraction streamlines analysis and clarifies when standard tools (e.g., adjustment, causal calculus, instrumental variables) remain valid under selection bias. We hope that these results deepen the SCM-based understanding of selection bias and become part of the standard causal modeling toolbox to build more reliable causal analysis.

[733] Supporting Risk Management for Medical Devices via the Riskman Ontology and Shapes (Preprint)

Piotr Gorczyca, Dörthe Arndt, Martin Diller, Jochen Hampe, Georg Heidenreich, Pascal Kettmann, Markus Krötzsch, Stephan Mennicke, Sebastian Rudolph, Hannes Strass

Main category: cs.AI

TL;DR: Riskman ontology and shapes for formal representation and analysis of medical device risk management documentation

DetailsMotivation: Current risk management documentation is submitted as semi-structured natural language text, lacking formal logical underpinning for certification processes

Method: Propose Riskman ontology with SHACL constraints to provide formal representation and validate compliance with ISO 14971 and VDE Spec 90025 standards

Result: Enables structured representation and automated validation of risk management documentation for medical devices

Conclusion: Formal ontology-based approach improves risk management documentation quality and compliance checking for medical device certification

Abstract: We propose the Riskman ontology and shapes for representing and analysing information about risk management for medical devices. Risk management is concerned with taking necessary precautions to ensure that a medical device does not cause harms for users or the environment. To date, risk management documentation is submitted to notified bodies (for certification) in the form of semi-structured natural language text. We propose to use terms from the Riskman ontology to provide a formal, logical underpinning for risk management documentation, and to use the included SHACL constraints to check whether the provided data is in accordance with the requirements of the two relevant norms, i.e. ISO 14971 and VDE Spec 90025.

[734] Extreme Value Monte Carlo Tree Search for Classical Planning

Masataro Asai, Stephen Wissow

Main category: cs.AI

TL;DR: This paper proposes UCB1-Uniform, a new bandit algorithm for Monte Carlo Tree Search in classical planning that addresses limitations of previous approaches by using Extreme Value Theory to properly model cost-to-go estimates.

DetailsMotivation: Previous MCTS approaches using UCB1 and Gaussian MABs have limitations: UCB1 assumes bounded rewards while cost-to-go estimates are unbounded, and Gaussian MABs incorrectly specify the support as infinite. Full Bellman backup also lacks theoretical justification.

Method: The authors use Peaks-Over-Threshold Extreme Value Theory to model cost-to-go estimates more accurately and propose UCB1-Uniform bandit algorithm. They formally prove its regret bound and test it empirically in classical planning.

Result: The paper demonstrates improved performance of UCB1-Uniform in classical planning tasks compared to previous approaches, with formal theoretical guarantees.

Conclusion: UCB1-Uniform provides a theoretically justified and empirically effective bandit algorithm for MCTS in classical planning by properly modeling cost-to-go estimates using Extreme Value Theory.

Abstract: Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in $\R$, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as $(-\infty,\infty)$, which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, and propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.

[735] Local Markov Equivalence for PC-style Local Causal Discovery and Identification of Controlled Direct Effects

Timothée Loranchet, Charles K. Assaad

Main category: cs.AI

TL;DR: Proposes local essential graphs (LEGs) and LocPC algorithm for identifying controlled direct effects using local conditional independence tests, avoiding full essential graph learning.

DetailsMotivation: Existing methods for identifying controlled direct effects require knowing the full causal DAG, which is often unknown in practice. Essential graphs are computationally intensive and depend on strong assumptions.

Method: Introduces local essential graphs (LEGs) representing local Markov equivalence classes, and presents LocPC algorithm to recover LEGs using local conditional independence tests. LocPC-CDE algorithm identifies sufficient portions of LEG for CDE identification.

Result: Compared to global methods, the proposed algorithms require fewer conditional independence tests and operate under weaker assumptions while maintaining theoretical guarantees. Simulation studies demonstrate effectiveness.

Conclusion: Local approaches using LEGs and LocPC provide a practical alternative to global essential graph learning for identifying controlled direct effects, with reduced computational burden and weaker assumptions.

Abstract: Understanding and identifying controlled direct effects (CDEs) is crucial across numerous scientific domains, including public health. While existing methods can identify these effects from causal directed acyclic graphs (DAGs), the true underlying structure is often unknown in practice. Essential graphs, which represent a Markov equivalence class of DAGs characterized by the same set of $d$-separations, provide a more practical and realistic alternative. However, learning the full essential graph is computationally intensive and typically depends on strong, untestable assumptions. In this work, we characterize a local class of graphs, defined relative to a target variable, that share a specific subset of $d$-separations, and introduce a graphical representation of this class, called the local essential graph (LEG). We then present LocPC, a novel algorithm designed to recover the LEG from an observed distribution using only local conditional independence tests. Building on LocPC, we propose LocPC-CDE, an algorithm that discovers the portion of the LEG that is both sufficient and necessary to identify a CDE, bypassing the need of retrieving the full essential graph. Compared to global methods, our algorithms require less conditional independence tests and operate under weaker assumptions while maintaining theoretical guarantees. We illustrate the effectiveness of our approach through simulation studies.

[736] EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Biao Yi, Xavier Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, Fan Wu

Main category: cs.AI

TL;DR: EcoAgent is a device-cloud collaborative multi-agent framework for mobile automation that addresses privacy and efficiency issues in cloud-only systems through closed-loop cooperation and dual-reasoning approach.

DetailsMotivation: Current mobile multi-agent systems are cloud-based, causing high latency, operational costs, and privacy concerns from uploading screenshots. There's a need for device-cloud collaboration that preserves privacy while maintaining efficiency.

Method: Proposed EcoAgent framework with Dual-ReACT reasoning in cloud-based Planning Agent and device-based Observation Agent with Pre-understanding Module that summarizes screen content into text descriptions to reduce token usage and communication overhead.

Result: Experiments on AndroidWorld show EcoAgent matches task success rates of fully cloud-based agents while reducing resource consumption and response latency.

Conclusion: EcoAgent enables privacy-aware, efficient, and responsive mobile automation through closed-loop device-cloud collaboration, achieving comparable performance to cloud-only systems with better resource efficiency.

Abstract: To tackle increasingly complex tasks, recent research on mobile agents has shifted towards multi-agent collaboration. Current mobile multi-agent systems are primarily deployed in the cloud, leading to high latency and operational costs. A straightforward idea is to deploy a device-cloud collaborative multi-agent system, which is nontrivial, as directly extending existing systems introduces new challenges: (1) reliance on cloud-side verification requires uploading mobile screenshots, compromising user privacy; and (2) open-loop cooperation lacking device-to-cloud feedback, underutilizing device resources and increasing latency. To overcome these limitations, we propose EcoAgent, a closed-loop device-cloud collaborative multi-agent framework designed for privacy-aware, efficient, and responsive mobile automation. EcoAgent integrates a novel reasoning approach, Dual-ReACT, into the cloud-based Planning Agent, fully exploiting cloud reasoning to compensate for limited on-device capacity, thereby enabling device-side verification and lightweight feedback. Furthermore, the device-based Observation Agent leverages a Pre-understanding Module to summarize screen content into concise textual descriptions, significantly reducing token usage and device-cloud communication overhead while preserving privacy. Experiments on AndroidWorld demonstrate that EcoAgent matches the task success rates of fully cloud-based agents, while reducing resource consumption and response latency. Our project is available here: https://github.com/Yi-Biao/EcoAgent.

[737] The Correspondence Between Bounded Graph Neural Networks and Fragments of First-Order Logic

Bernardo Cuenca Grau, Eva Feng, Przemysław A. Wałęga

Main category: cs.AI

TL;DR: GNNs handle graph-structured data but their expressive power is not fully understood. This paper links GNN architectures to first-order logic fragments, providing a framework for understanding their logical expressiveness.

DetailsMotivation: To understand the expressive power of Graph Neural Networks (GNNs) by establishing precise connections between GNN architectures and fragments of first-order logic.

Method: Apply finite model theory methods from first-order and modal logics to graph representation learning, proposing GNN architectures that correspond to specific first-order logic fragments.

Result: Established precise correspondences between GNN architectures and prominent fragments of first-order logic, including modal logics and two-variable fragments.

Conclusion: Provides a unifying framework for understanding the logical expressiveness of GNNs within first-order logic, bridging graph representation learning with finite model theory.

Abstract: Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we propose GNN architectures that correspond precisely to prominent fragments of first-order logic (FO), including various modal logics as well as more expressive two-variable fragments. To establish these results, we apply methods from finite model theory of first-order and modal logics to the domain of graph representation learning. Our results provide a unifying framework for understanding the logical expressiveness of GNNs within FO.

[738] Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake, Paulo Shakarian, Nathaniel Bastian, John Corcoran, Gerardo Simari

Main category: cs.AI

TL;DR: The paper proposes a consistency-based abduction framework that integrates predictions from multiple pre-trained models to handle distributional shifts, improving robustness in novel environments while maintaining high recall.

DetailsMotivation: To address performance degradation of pre-trained perception models in novel environments due to distributional shifts, and overcome the precision-recall trade-off in existing metacognition approaches.

Method: Formulates model integration as a consistency-based abduction problem, encoding predictions and error detection rules in logic programs. Uses Integer Programming and Heuristic Search algorithms to find abductive explanations that maximize coverage while keeping inconsistency rates below a threshold.

Result: Outperforms individual models and standard ensembles, achieving average improvements of 13.6% in F1-score and 16.6% in accuracy across 15 test datasets with complex distributional shifts.

Conclusion: Consistency-based abduction effectively integrates knowledge from multiple imperfect models, providing robust performance in challenging novel scenarios.

Abstract: The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem, building on the idea of abductive learning (ABL) but applying it to test-time instead of training. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation–a subset of model predictions–that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.

[739] Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing

Lixiang Xu, Xianwei Ding, Xin Yuan, Richang Hong, Feiping Nie, Enhong Chen, Philip S. Yu

Main category: cs.AI

TL;DR: CRDP-KT is a knowledge tracing model that uses dynamic programming to optimize cognitive representations, maintaining continuity and coherence while minimizing prediction bias from non-cognitive factors like slipping and guessing.

DetailsMotivation: Existing KT methods focus on feature enhancement but overlook cognitive representation deficiencies and interference from non-cognitive factors, which hampers capturing the continuity and coherence of students' cognitive processes.

Method: Uses dynamic programming algorithm to optimize cognitive representations based on question difficulty and performance intervals, performs partitioned optimization, and uses weighted fusion of optimized representations with relationships learned from bipartite graphs.

Result: Experiments on three public datasets validate the effectiveness of the CRDP-KT model in providing more accurate and systematic input features for model training.

Conclusion: The CRDP-KT model successfully maintains cognitive continuity and coherence, minimizes distortion in cognitive state simulation, and enhances the reliability of cognitive representation optimization.

Abstract: Knowledge Tracing (KT) involves monitoring the changes in a student’s knowledge over time by analyzing their past responses, with the goal of predicting future performance. However, most existing methods primarily focus on feature enhancement, while overlooking the deficiencies in cognitive representation and the ability to express cognition-issues often caused by interference from non-cognitive factors such as slipping and guessing. This limitation hampers the ability to capture the continuity and coherence of the student’s cognitive process. As a result, many methods may introduce more prediction bias and modeling costs due to their inability to maintain cognitive continuity and coherence. Based on the above discussion, we propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model. This model em ploys a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them. This approach ensures that the cognitive representation aligns with the student’s cognitive patterns, maintaining overall continuity and coherence. As a result, it provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states. Additionally, the CRDP-KT model performs partitioned optimization of cognitive representations to enhance the reliability of the optimization process. Furthermore, it improves its ability to express the student’s cognition through a weighted fusion of optimized record representations and re lationships learned from a bipartite graph. Finally, experiments conducted on three public datasets validate the effectiveness of the proposed CRDP-KT model.

[740] Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

Yuyang Wanyan, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Jiabo Ye, Yutong Kou, Ming Yan, Fei Huang, Xiaoshan Yang, Weiming Dong, Changsheng Xu

Main category: cs.AI

TL;DR: The paper introduces a pre-operative critic mechanism for GUI automation using MLLMs, proposing S-GRPO strategy to create GUI-Critic-R1 model with suggestion rewards, and develops data collection pipeline for GUI critic evaluation.

DetailsMotivation: GUI automation requires online interactive decision-making with low error tolerance, as mistakes can cumulatively disrupt processes and cause irreversible outcomes like deletions or payments.

Method: Proposed Suggestion-aware Gradient Relative Policy Optimization (S-GRPO) strategy to build pre-operative critic model GUI-Critic-R1, incorporating suggestion rewards, and developed reasoning-bootstrapping data collection pipeline for GUI-Critic-Train and GUI-Critic-Test datasets.

Result: Static experiments show GUI-Critic-R1 has significant advantages in critic accuracy over current MLLMs across mobile and web domains. Dynamic evaluation on GUI automation benchmark demonstrates improved success rates and operational efficiency.

Conclusion: The pre-operative critic mechanism with S-GRPO strategy effectively enhances reliability of GUI automation by providing feedback before execution, addressing the critical need for error reduction in online interactive environments.

Abstract: In recent years, Multimodal Large Language Models (MLLMs) have been extensively utilized for multimodal reasoning tasks, including Graphical User Interface (GUI) automation. Unlike general offline multimodal tasks, GUI automation is executed in online interactive environments, necessitating step-by-step decision-making based on real-time status of the environment. This task has a lower tolerance for decision-making errors at each step, as any mistakes may cumulatively disrupt the process and potentially lead to irreversible outcomes like deletions or payments. To address these issues, we introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution, by reasoning about the potential outcome and correctness of actions. Specifically, we propose a Suggestion-aware Gradient Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance the reliability of the model’s feedback. Furthermore, we develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic data. Static experiments on the GUI-Critic-Test across both mobile and web domains reveal that our GUI-Critic-R1 offers significant advantages in critic accuracy compared to current MLLMs. Dynamic evaluation on GUI automation benchmark further highlights the effectiveness and superiority of our model, as evidenced by improved success rates and operational efficiency.

[741] A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

Ehsan Futuhi, Nathan R. Sturtevant

Main category: cs.AI

TL;DR: GPU parallelization of heuristics for depth-first search algorithms like IDA* and BTS by running tree search on CPU and heuristic evaluation on GPU.

DetailsMotivation: Previous research focused on batching heuristic evaluations for best-first search algorithms, but depth-first algorithms like IDA* and BTS remained unaddressed due to the complexity of blocking in tree search.

Method: Developed a parallelized cost-bounded depth-first search (CB-DFS) framework that parallelizes tree search on CPU while batching heuristic evaluations on GPU.

Result: Significantly improved performance of IDA* and BTS on 3x3 Rubik’s Cube and 4x4 sliding tile puzzle with both classifier-based and regression-based heuristics.

Conclusion: GPU parallelization of heuristics can be effectively integrated with depth-first search algorithms through a hybrid CPU-GPU approach, overcoming the blocking problem in tree search.

Abstract: The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. This hardware has been exploited in best-first search algorithms with neural network-based heuristics by creating batched versions of A* and Weighted A* that delay heuristic evaluation until sufficiently many states can be evaluated in parallel on the GPU. But, research has not addressed how depth-first algorithms like IDA* or Budgeted Tree Search (BTS) can have their heuristic computations batched. This is more complicated in a tree search, because progress in the search tree is blocked until heuristic evaluations are complete. In this paper we show that GPU parallelization of heuristics can be effectively performed when the tree search is parallelized on the CPU while heuristic evaluations are parallelized on the GPU. We develop a parallelized cost-bounded depth-first search (CB-DFS) framework that can be applied to both IDA* and BTS, significantly improving their performance. We demonstrate the strength of the approach on the 3x3 Rubik’s Cube and the 4x4 sliding tile puzzle (STP) with both classifier-based and regression-based heuristics.

[742] Automated Algorithmic Discovery for Scientific Computing through LLM-Guided Evolutionary Search: A Case Study in Gravitational-Wave Detection

He Wang, Liang Zeng

Main category: cs.AI

TL;DR: Evo-MCTS framework combines LLMs with evolutionary search for interpretable algorithm discovery, achieving 20.2% improvement over domain-specific methods in gravitational wave detection.

DetailsMotivation: Address challenges in automated algorithm discovery: vast design spaces with expensive evaluations, domain-specific physical constraints requiring expert knowledge, and need for interpretable solutions scientists can validate.

Method: Integrates large language models with tree-structured evolutionary search, combining reflective code synthesis using LLM domain knowledge, multi-scale evolutionary operations on structured code representations, and interpretable algorithmic pathways from tree-guided exploration.

Result: Achieves 20.2% improvement over domain-specific methods and 59.1% over LLM-based optimization frameworks in gravitational wave detection, with consistent convergence toward interpretable algorithmic structures integrating multiple functional components.

Conclusion: Evo-MCTS establishes a generalizable methodology for automated algorithm discovery in scientific computing where algorithmic transparency and physical validity are as essential as performance optimization.

Abstract: Automated algorithm discovery in scientific computing faces fundamental challenges: vast design spaces with expensive evaluations, domain-specific physical constraints requiring expert knowledge, and the necessity for interpretable solutions that scientists can validate and understand. We present the Evo-MCTS (Evolutionary Monte Carlo Tree Search) framework, integrating large language models (LLMs) with tree-structured evolutionary search for interpretable algorithm discovery. Evo-MCTS combines reflective code synthesis leveraging LLM domain knowledge, multi-scale evolutionary operations on structured code representations, and interpretable algorithmic pathways emerging from tree-guided exploration. When applied to gravitational wave detection-a challenging domain with continuous parameter spaces and strict physical constraints-Evo-MCTS achieves 20.2% improvement over domain-specific methods and 59.1% over LLM-based optimization frameworks. This improvement arises from its ability to consistently converge toward interpretable algorithmic structures that integrate multiple functional components. Our domain-agnostic architecture establishes a generalizable methodology for automated algorithm discovery in scientific computing, where algorithmic transparency and physical validity are as essential as performance optimization.

[743] Argumentative Debates for Transparent Bias Detection [Technical Report]

Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni

Main category: cs.AI

TL;DR: ABIDE is a transparent bias detection framework that structures bias detection as debate using argument graphs, addressing the need for interpretability in algorithmic fairness.

DetailsMotivation: Existing bias detection methods often lack transparency, while interpretability and explainability are crucial for algorithmic fairness due to its human-oriented nature.

Method: ABIDE structures bias detection as debate guided by an underlying argument graph from formal and computational argumentation, focusing on arguments about success chances of groups in local neighborhoods and their significance.

Result: Experimental evaluation shows ABIDE outperforms an argumentative baseline in performance.

Conclusion: ABIDE provides a transparent and effective approach to bias detection through argumentative debate, addressing the interpretability gap in existing methods.

Abstract: As the use of AI in society grows, addressing emerging biases is essential to prevent systematic discrimination. Several bias detection methods have been proposed, but, with few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. We present ABIDE (Argumentative BIas detection by DEbate), a novel framework that structures bias detection transparently as debate, guided by an underlying argument graph as understood in (formal and computational) argumentation. The arguments are about the success chances of groups in local neighbourhoods and the significance of these neighbourhoods. We evaluate ABIDE experimentally and demonstrate its strengths in performance against an argumentative baseline.

[744] LLM Collaboration With Multi-Agent Reinforcement Learning

Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, Christopher Amato

Main category: cs.AI

TL;DR: The paper proposes MAGRPO, a multi-agent reinforcement learning method for fine-tuning LLMs to improve collaboration, addressing the gap in current LLM training that focuses on individual rather than coordinated performance.

DetailsMotivation: Current LLMs are pretrained independently without optimization for coordination, and existing fine-tuning frameworks rely on individual rewards that require complex designs to encourage collaboration.

Method: Model LLM collaboration as a cooperative MARL problem and develop Multi-Agent Group Relative Policy Optimization (MAGRPO), a multi-agent, multi-turn algorithm building on RL approaches for LLMs and MARL techniques.

Result: Experiments on LLM writing and coding collaboration show that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation.

Conclusion: The approach opens the door to using other MARL methods for LLMs and highlights associated challenges in multi-agent coordination.

Abstract: A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges.

[745] Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach

Hannes Ihalainen, Dieter Vandesande, André Schidler, Jeremias Berg, Bart Bogaerts, Matti Järvisalo

Main category: cs.AI

TL;DR: The paper explores alternative algorithmic techniques for hitting set optimization in implicit hitting set (IHS) framework, comparing pseudo-Boolean reasoning and stochastic local search against traditional integer programming approaches.

DetailsMotivation: To address computational challenges and numerical instability issues in traditional integer programming approaches for hitting set optimization within the implicit hitting set framework.

Method: Evaluates alternative techniques including pseudo-Boolean reasoning and stochastic local search for hitting set computations, comparing them with commercial IP solvers in the context of pseudo-Boolean optimization.

Result: Found that while commercial IP solvers remain most effective, they can cause correctness issues due to numerical instability; exact HS computations via PB reasoning can be competitive with numerically exact IP solvers and provide correctness certificates.

Conclusion: PB reasoning offers a viable alternative to IP solvers for hitting set computations, providing better reliability through numerical stability and enabling correctness certificates for IHS computations.

Abstract: The implicit hitting set (IHS) approach offers a general framework for solving computationally hard combinatorial optimization problems declaratively. IHS iterates between a decision oracle used for extracting sources of inconsistency and an optimizer for computing so-called hitting sets (HSs) over the accumulated sources of inconsistency. While the decision oracle is language-specific, the optimizers is usually instantiated through integer programming. We explore alternative algorithmic techniques for hitting set optimization based on different ways of employing pseudo-Boolean (PB) reasoning as well as stochastic local search. We extensively evaluate the practical feasibility of the alternatives in particular in the context of pseudo-Boolean (0-1 IP) optimization as one of the most recent instantiations of IHS. Highlighting a trade-off between efficiency and reliability, while a commercial IP solver turns out to remain the most effective way to instantiate HS computations, it can cause correctness issues due to numerical instability; in fact, we show that exact HS computations instantiated via PB reasoning can be made competitive with a numerically exact IP solver. Furthermore, the use of PB reasoning as a basis for HS computations allows for obtaining certificates for the correctness of IHS computations, generally applicable to any IHS instantiation in which reasoning in the declarative language at hand can be captured in the PB-based proof format we employ.

[746] UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang

Main category: cs.AI

TL;DR: UDA is an unsupervised framework that reduces preference bias in LLM pairwise evaluations by dynamically adjusting Elo ratings to align judges towards consensus, improving evaluation reliability.

DetailsMotivation: Pairwise LLM evaluation suffers from preference bias where judges systematically favor certain outputs, leading to inconsistent and skewed rankings across different judges.

Method: UDA uses a compact neural network to adaptively set the K-factor and refine win probabilities in Elo rating system, operating unsupervised by minimizing dispersion among Elo trajectories to force alignment towards collective consensus.

Result: UDA reduces inter-judge rating standard deviation by up to 63.4% and improves average correlation with human judgments by 24.7%, elevating poorly performing judges to achieve parity with high-quality ones.

Conclusion: UDA provides a more robust and reliable evaluation ecosystem by reducing preference bias and improving consistency in LLM pairwise evaluations through unsupervised consensus alignment.

Abstract: Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.

[747] PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu

Main category: cs.AI

TL;DR: PASS is a multimodal framework for Chest X-Ray reasoning that addresses black-box reasoning, poor multimodal integration, and inefficient agentic pipelines by adaptively sampling probabilistic workflows over a multi-tool graph.

DetailsMotivation: Existing tool-augmented agentic systems have limitations including black-box reasoning that undermines trust and safety, poor multimodal integration critical for healthcare, and rigid inefficient pipelines.

Method: PASS adaptively samples agentic workflows over a multi-tool graph with probability annotations, uses task-conditioned distribution to select tools at each layer, compresses findings into evolving memory, and employs three-stage training with expert warm-up, contrastive path-ranking, and cost-aware RL.

Result: PASS significantly outperforms strong baselines across multiple metrics (accuracy, AUC, LLM-J) while balancing computational costs on the CAB-E benchmark for CXR reasoning.

Conclusion: PASS pushes a paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems by providing probabilistic, auditable decision paths that enhance medical AI safety and efficiency.

Abstract: Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.

[748] MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: MSRS enables multi-attribute steering in LLMs by allocating orthogonal subspaces to reduce interference and using dynamic token-level intervention for precise control.

DetailsMotivation: Existing activation steering methods struggle with multi-attribute control due to interference and trade-offs between attributes.

Method: Multi-Subspace Representation Steering (MSRS) uses orthogonal subspaces for each attribute, hybrid subspace composition (attribute-specific + shared), and token-level steering during inference.

Result: MSRS significantly reduces attribute conflicts, outperforms existing methods across multiple attributes, and generalizes well to diverse downstream tasks.

Conclusion: MSRS provides an effective framework for multi-attribute steering in LLMs by addressing interference through subspace isolation and enabling fine-grained control.

Abstract: Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model’s representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

[749] One VLM, Two Roles: Stage-Wise Routing and Specialty-Level Deployment for Clinical Workflows

Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar

Main category: cs.AI

TL;DR: A framework using a single vision-language model in two roles: as an aware model-card matcher for routing images to specialist models, and as a fine-tuned model covering multiple tasks per specialty to simplify deployment.

DetailsMotivation: Clinical ML workflows are fragmented and inefficient with task-specific networks, lacking data-driven model identification and standardized output delivery, reducing efficiency and increasing costs.

Method: Two solutions: 1) VLM as model-card matcher with three-stage routing workflow and calibrated top-2 answer selection; 2) Fine-tuning same VLM on specialty-specific datasets to cover multiple downstream tasks per specialty.

Result: Solution 1 improved routing accuracy by +9-11 percentage points with better calibration; Solution 2 matched or approached specialized baselines across five medical specialties while simplifying deployment.

Conclusion: The framework reduces data-science effort through accurate selection, simplifies monitoring by consolidating models, and increases transparency via stage-wise justifications and calibrated thresholds.

Abstract: Clinical ML workflows are often fragmented and inefficient: triage, task selection, and model deployment are handled by a patchwork of task-specific networks. These pipelines are rarely aligned with data-science practice, reducing efficiency and increasing operational cost. They also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. We present a framework that employs a single vision-language model (VLM) in two complementary, modular roles. First (Solution 1): the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card ID). Reliability is improved by (i) stage-wise prompts enabling early termination via “None”/“Other” and (ii) a calibrated top-2 answer selector with a stage-wise cutoff. This raises routing accuracy by +9 and +11 percentage points on the training and held-out splits, respectively, compared with a baseline router, and improves held-out calibration (lower Expected Calibration Error, ECE). Second (Solution 2): we fine-tune the same VLM on specialty-specific datasets so that one model per specialty covers multiple downstream tasks, simplifying deployment while maintaining performance. Across gastroenterology, hematology, ophthalmology, pathology, and radiology, this single-model deployment matches or approaches specialized baselines. Together, these solutions reduce data-science effort through more accurate selection, simplify monitoring and maintenance by consolidating task-specific models, and increase transparency via per-stage justifications and calibrated thresholds. Each solution stands alone, and in combination they offer a practical, modular path from triage to deployment.

[750] Timely Clinical Diagnosis through Active Test Selection

Silas Ruhrberg Estévez, Nicolás Astorga, Mihaela van der Schaar

Main category: cs.AI

TL;DR: ACTMED is a diagnostic framework that combines Bayesian Experimental Design with LLMs to optimize clinical test selection, reducing diagnostic uncertainty while maintaining clinician oversight.

DetailsMotivation: Current ML approaches for clinical diagnosis fail to capture sequential, resource-aware reasoning used by clinicians, especially in high-pressure or resource-limited settings where timely and cost-effective decisions are crucial.

Method: Integrates Bayesian Experimental Design with large language models to select tests that maximize reduction in diagnostic uncertainty at each step. LLMs serve as flexible simulators for patient state distributions and belief updates without requiring structured training data.

Result: ACTMED optimizes test selection to improve diagnostic accuracy, interpretability, and resource use on real-world datasets.

Conclusion: The framework represents progress toward transparent, adaptive, clinician-aligned diagnostic systems that generalize across settings with reduced need for domain-specific data.

Abstract: There is growing interest in using machine learning (ML) to support clinical diagnosis, but most approaches rely on static, fully observed datasets and fail to reflect the sequential, resource-aware reasoning clinicians use in practice. Diagnosis remains complex and error prone, especially in high-pressure or resource-limited settings, underscoring the need for frameworks that help clinicians make timely and cost-effective decisions. We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design), a diagnostic framework that integrates Bayesian Experimental Design (BED) with large language models (LLMs) to better emulate real-world diagnostic reasoning. At each step, ACTMED selects the test expected to yield the greatest reduction in diagnostic uncertainty for a given patient. LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data. Clinicians can remain in the loop; reviewing test suggestions, interpreting intermediate outputs, and applying clinical judgment throughout. We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use. This represents a step toward transparent, adaptive, and clinician-aligned diagnostic systems that generalize across settings with reduced reliance on domain-specific data.

[751] FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use

Zengzhuang Xu, Bingguang Hao, Zechuan Wang, Yuntao Wen, Xinyi Xu, Yang Liu, Long Chen, Dong Wang, Maolin Wang, Tong Zhao, Yicheng Chen, Cunyin Peng, Jinjie Gu, Leilei Gan, Xiangyu Zhao, Chenyi Zhuang, Shi Gu

Main category: cs.AI

TL;DR: FunReason-MT is a novel data synthesis framework that addresses challenges in generating high-quality multi-turn function calling training data through environment-API graph interactions, advanced tool-query synthesis, and guided iterative chains.

DetailsMotivation: Existing data synthesis methods are insufficient for generating high-quality multi-turn function calling data in real-world environments, facing challenges in targeted data synthesis, hard query construction, and multi-turn logical dependency.

Method: The framework employs three key techniques: 1) Environment-API Graph Interactions for gathering varied trajectories, 2) Advanced Tool-Query Synthesis for simplifying hard query construction, and 3) Guided Iterative Chain for sophisticated Chain-of-Thought generation.

Result: A 4B model trained on FunReason-MT generated data achieves state-of-the-art performance on Berkeley Function-Calling Leaderboard (BFCLv3) among comparable-sized models, with further improvements confirmed on BFCLv4.

Conclusion: FunReason-MT provides a reliable and robust source for agentic learning, effectively addressing the structural deficiencies in multi-turn function calling data synthesis for real-world tool use.

Abstract: Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted data synthesis, hard query construction, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories with targeted tool, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.

Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li

Main category: cs.AI

TL;DR: Hybrid legal QA agent combining RAG with multi-model ensembling reduces hallucination and improves reliability in judicial settings through retrieval prioritization, human-in-the-loop updates, and dynamic knowledge evolution.

DetailsMotivation: Address LLM hallucination risks in legal consultation and overcome limitations of static knowledge bases that can't keep pace with frequently updated statutes and case law.

Method: Retrieval-prioritized hybrid approach: uses RAG when trusted repository has evidence, otherwise employs multi-model ensembling with specialized selector. Includes human review and knowledge repository updates.

Result: Significantly outperforms single-model baseline and vanilla RAG on Law_QA dataset across F1, ROUGE-L, and LLM-as-a-Judge metrics. Reduces hallucination while improving answer quality and legal compliance.

Conclusion: Demonstrates effective practical deployment of media forensics in judicial scenarios through reliable, auditable, and continuously updatable legal QA system.

Abstract: As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.

[753] SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Häggström, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Håkan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

Main category: cs.AI

TL;DR: SnapStream is a KV cache compression method that enables 4x improved on-chip memory usage with minimal accuracy degradation, deployed in production inference systems with static graphs and continuous batching.

DetailsMotivation: The proliferation of large LLMs with long context lengths creates high demands for on-chip memory for KV caches, but existing techniques like StreamingLLM and SnapKV are not widely adopted in industrial deployments due to framework constraints and unclear accuracy implications.

Method: Developed SnapStream, a KV cache compression method that can be deployed at scale in frameworks with static graphs and continuous batching. Explored accuracy implications on modern models like Llama-3.1-8B-Instruct and DeepSeek-R1.

Result: Demonstrated 4x improved on-chip memory usage with minimal accuracy degradation on benchmarks (LongBench-v2, AIME24, LiveCodeBench). Successfully deployed in 16-way tensor-parallel DeepSeek-671B on SambaNova SN40L accelerators at 128k context length with up to 1832 tokens/second throughput.

Conclusion: SnapStream is the first implementation of sparse KV attention techniques successfully deployed in production inference systems with static graphs and continuous batching, addressing the memory demands of large LLMs while maintaining accuracy.

Abstract: The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

[754] DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents

Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao

Main category: cs.AI

TL;DR: A novel safety response framework for LLMs that uses input-level risk classification and output-level RAG with interpretation models to ensure secure responses and eliminate information fabrication.

DetailsMotivation: Address security issues in LLMs that constrain their trustworthy deployment in critical domains by providing systematic protection at both input and output levels.

Method: Input: supervised fine-tuning-based safety classification with 4-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention). Output: RAG integration with fine-tuned interpretation model for real-time trustworthy knowledge grounding.

Result: Achieved 99.3% risk recall rate, significantly higher safety scores on public benchmarks than TinyR1-Safety-8B baseline, and perfect 100% safety score on proprietary high-risk test set.

Conclusion: Provides an effective engineering pathway for building high-security, high-trust LLM applications with exceptional protective capabilities in complex risk scenarios.

Abstract: With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework’s components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.

[755] EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

Xiao Yang, Xuejiao Zhao, Zhiqi Shen

Main category: cs.AI

TL;DR: EHRStruct is a benchmark for evaluating LLMs on structured EHR data with 11 tasks and 2,200 samples, revealing performance gaps and proposing EHRMaster as a code-augmented solution.

DetailsMotivation: The absence of standardized evaluation frameworks makes it difficult to systematically assess and compare LLM performance on structured EHR data.

Method: Introduces EHRStruct benchmark with 11 representative clinical tasks and 2,200 evaluation samples from two EHR datasets, evaluating 20 LLMs and analyzing factors like input formats and finetuning strategies.

Result: Many structured EHR tasks place high demands on LLM understanding and reasoning capabilities, with EHRMaster achieving state-of-the-art performance.

Conclusion: EHRStruct addresses evaluation challenges for LLMs on structured EHR data, and EHRMaster provides an effective code-augmented approach for improved performance.

Abstract: Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks.However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data.To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks.EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets.We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models.We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs.In response, we propose EHRMaster, a code-augmented method that achieves state-of-the-art performance and offers practical

[756] On Geometric Structures for Policy Parameterization in Continuous Control

Zhihao Lin

Main category: cs.AI

TL;DR: Proposes a novel action generation method for continuous control that decomposes actions into directional vectors and concentration scalars, enabling efficient interpolation on the unit manifold while reducing parameters and maintaining simple sampling complexity.

DetailsMotivation: Standard stochastic policies use boundary-enforcing transformations (e.g., tanh) that distort optimization landscapes and introduce gradient pathologies. Alternative unit manifold parameterizations are computationally complex, limiting practical use.

Method: Decomposes action into deterministic directional vector and learnable concentration scalar, enabling efficient interpolation between target direction and uniform noise on the unit manifold. Reduces policy head parameters from 2d to d+1 and maintains O(d) sampling complexity.

Result: Matches or exceeds state-of-the-art methods on standard continuous control benchmarks, with significant improvements (+37.6% and +112%) on high-dimensional locomotion tasks. Ablation studies confirm unit-norm normalization and adaptive concentration mechanism are essential.

Conclusion: Robust, efficient control can be achieved by explicitly respecting the structure of bounded action spaces rather than relying on complex, unbounded distributions.

Abstract: Standard stochastic policies for continuous control often rely on ad-hoc boundary-enforcing transformations (e.g., tanh) which can distort the underlying optimization landscape and introduce gradient pathologies. While alternative parameterizations on the unit manifold (e.g., directional distributions) are theoretically appealing, their computational complexity (often requiring special functions or rejection sampling) has limited their practical use. We propose a novel, computationally efficient action generation paradigm that preserves the structural benefits of operating on a unit manifold. Our method decomposes the action into a deterministic directional vector and a learnable concentration scalar, enabling efficient interpolation between the target direction and uniform noise on the unit manifold. This design can reduce policy head parameters by nearly 50% (from $2d$ to $d+1$) and maintains a simple $O(d)$ sampling complexity, avoiding costly sampling procedures. Empirically, our method matches or exceeds state-of-the-art methods on standard continuous control benchmarks, with significant improvements (e.g., +37.6% and +112%) on high-dimensional locomotion tasks. Ablation studies confirm that both the unit-norm normalization and the adaptive concentration mechanism are essential to the method’s success. These findings suggest that robust, efficient control can be achieved by explicitly respecting the structure of bounded action spaces, rather than relying on complex, unbounded distributions. Code is available in supplementary materials.

[757] JobSphere: An AI-Powered Multilingual Career Copilot for Government Employment Platforms

Srihari R, Adarsha B, Mohammed Usman Hussain, Shweta Singh

Main category: cs.AI

TL;DR: JobSphere is an AI-powered career assistant for Punjab’s PGRKAM employment platform that uses RAG architecture with multilingual support, voice interaction, and cost-effective deployment on consumer GPUs.

DetailsMotivation: To address engagement and accessibility challenges in government employment websites, including navigational complexity, limited language options, and lack of personalized support for users in Punjab.

Method: Uses Retrieval-Augmented Generation (RAG) architecture with 4-bit quantization for deployment on consumer-grade GPUs, featuring voice-enabled interaction, automated mock tests, resume parsing with skills recognition, and embed-based job recommendations.

Result: Achieved 94% factual accuracy, 1.8s median response time, 68% precision@10 for job recommendations, and 78.5/100 System Usability Scale score (50% improvement over baseline). Implementation is 89% cheaper than cloud-based systems.

Conclusion: JobSphere effectively fills accessibility gaps for Punjab/Hindi-speaking rural users while providing trusted job content from government agencies.

Abstract: Users of government employment websites commonly face engagement and accessibility challenges linked to navigational complexity, a dearth of language options, and a lack of personalized support. This paper introduces JobSphere, an AI-powered career assistant that is redefining the employment platform in Punjab called PGRKAM. JobSphere employs Retrieval-Augmented Generation (RAG) architecture, and it is multilingual, available in English, Hindi and Punjabi. JobSphere technique uses 4-bit quantization, allowing the platform to deploy on consumer-grade GPUs (i.e., NVIDIA RTX 3050 4GB), making the implementation 89% cheaper than that of cloud-based systems. Key innovations include voice-enabled interaction with the assistant, automated mock tests, resume parsing with skills recognition, and embed-based job recommendation that achieves a precision@10 score of 68%. An evaluation of JobSphere’s implementation reveals 94% factual accuracy, a median response time of 1.8 seconds, and a System Usability Scale score of 78.5/100, a 50% improvement compared to the baseline PGRKAM platform context. In conclusion, JobSphere effectively fills significant accessibility gaps for Punjab/Hindi-speaking users in rural locations, while also affirming the users access to trusted job content provided by government agencies.

[758] AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis

Srihari R, Pallavi M, Tejaswini S, Vaishnavi R C

Main category: cs.AI

TL;DR: AI-powered platform automates data analysis and visualization from data upload to interactive output using ML algorithms for data cleaning, feature analysis, and intelligent visualization selection.

DetailsMotivation: To eliminate time-consuming manual data analysis and establish automated AI-based analysis in data-driven environments.

Method: Uses Python Flask backend with React frontend, Firebase Cloud Storage, ML algorithms for data cleaning (imputation, outlier detection), feature selection with four algorithms, and intelligent visualization generation based on dataset attributes.

Result: Successfully processed datasets up to 100,000 rows in real-time, scaled to handle multiple simultaneous users, and maintained high-quality visual outputs with reduced manual inputs.

Conclusion: The cloud-based application significantly reduces manual intervention in data analysis while delivering high-quality visualizations and user experiences.

Abstract: An AI-powered data visualization platform that automates the entire data analysis process, from uploading a dataset to generating an interactive visualization. Advanced machine learning algorithms are employed to clean and preprocess the data, analyse its features, and automatically select appropriate visualizations. The system establishes the process of automating AI-based analysis and visualization from the context of data-driven environments, and eliminates the challenge of time-consuming manual data analysis. The combination of a Python Flask backend to access the dataset, paired with a React frontend, provides a robust platform that automatically interacts with Firebase Cloud Storage for numerous data processing and data analysis solutions and real-time sources. Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers, via analysis of the data set. AI solutions to intelligently select features, using four different algorithms, and intelligent title generation and visualization are determined by the attributes of the dataset. These contributions were evaluated using two separate datasets to assess the platform’s performance. In the process evaluation, the initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests from multiple users and processes them simultaneously. In conclusion, the cloud-based data visualization application allowed for a significant reduction of manual inputs to the data analysis process while maintaining a high quality, impactful visual outputs, and user experiences

[759] Heterogeneous Graph Neural Networks for Assumption-Based Argumentation

Preesha Gehlot, Anna Rapberger, Fabrizio Russo, Francesca Toni

Main category: cs.AI

TL;DR: First GNN approach for approximating credulous acceptance in Assumption-Based Argumentation (ABA), achieving up to 0.71 F1 score and enabling polynomial-time extension reconstruction.

DetailsMotivation: Exact computation of extensions under stable semantics in ABA is intractable for large frameworks, necessitating scalable approximate reasoning methods.

Method: Model ABA frameworks via dependency graphs with heterogeneous edges, propose ABAGCN and ABAGAT architectures with residual convolution/attention layers, train on ICCMA 2023 benchmark augmented with synthetic data.

Result: Both models outperform adapted GNN baseline, achieving 0.71 F1 score on ICCMA instances. Extension reconstruction achieves F1 above 0.85 on small ABAFs and ~0.58 on large frameworks.

Conclusion: This work opens new avenues for scalable approximate reasoning in structured argumentation using GNNs.

Abstract: Assumption-Based Argumentation (ABA) is a powerful structured argumentation formalism, but exact computation of extensions under stable semantics is intractable for large frameworks. We present the first Graph Neural Network (GNN) approach to approximate credulous acceptance in ABA. To leverage GNNs, we model ABA frameworks via a dependency graph representation encoding assumptions, claims and rules as nodes, with heterogeneous edge labels distinguishing support, derive and attack relations. We propose two GNN architectures - ABAGCN and ABAGAT - that stack residual heterogeneous convolution or attention layers, respectively, to learn node embeddings. Our models are trained on the ICCMA 2023 benchmark, augmented with synthetic ABAFs, with hyperparameters optimised via Bayesian search. Empirically, both ABAGCN and ABAGAT outperform a state-of-the-art GNN baseline that we adapt from the abstract argumentation literature, achieving a node-level F1 score of up to 0.71 on the ICCMA instances. Finally, we develop a sound polynomial time extension-reconstruction algorithm driven by our predictor: it reconstructs stable extensions with F1 above 0.85 on small ABAFs and maintains an F1 of about 0.58 on large frameworks. Our work opens new avenues for scalable approximate reasoning in structured argumentation.

[760] HyperD: Hybrid Periodicity Decoupling Framework for Traffic Forecasting

Minlan Shao, Zijian Zhang, Yili Wang, Yiwei Dai, Xu Shen, Xin Wang

Main category: cs.AI

TL;DR: HyperD is a novel traffic forecasting framework that decouples traffic data into periodic and residual components using hybrid periodic representation and frequency-aware modeling to handle complex spatial-temporal dependencies and multi-scale patterns.

DetailsMotivation: Traffic forecasting is challenging due to complex spatial dependencies and the coexistence of multi-scale periodic patterns with irregular fluctuations from unpredictable events like accidents and weather.

Method: Proposes HyperD framework with: 1) Hybrid Periodic Representation Module using learnable periodic embeddings and spatial-temporal attention for daily/weekly patterns, 2) Frequency-Aware Residual Representation Module using complex-valued MLP in frequency domain for non-periodic fluctuations, and 3) Dual-View Alignment Loss to enforce semantic separation between components.

Result: Extensive experiments on four real-world traffic datasets show HyperD achieves state-of-the-art prediction accuracy, superior robustness under disturbances, and improved computational efficiency compared to existing methods.

Conclusion: HyperD effectively addresses traffic forecasting challenges by decoupling periodic and residual components, demonstrating strong performance and practical advantages for intelligent transportation systems.

Abstract: Accurate traffic forecasting plays a vital role in intelligent transportation systems, enabling applications such as congestion control, route planning, and urban mobility optimization. However, traffic forecasting remains challenging due to two key factors: (1) complex spatial dependencies arising from dynamic interactions between road segments and traffic sensors across the network, and (2) the coexistence of multi-scale periodic patterns (e.g., daily and weekly periodic patterns driven by human routines) with irregular fluctuations caused by unpredictable events (e.g., accidents, weather, or construction). To tackle these challenges, we propose HyperD (Hybrid Periodic Decoupling), a novel framework that decouples traffic data into periodic and residual components. The periodic component is handled by the Hybrid Periodic Representation Module, which extracts fine-grained daily and weekly patterns using learnable periodic embeddings and spatial-temporal attention. The residual component, which captures non-periodic, high-frequency fluctuations, is modeled by the Frequency-Aware Residual Representation Module, leveraging complex-valued MLP in frequency domain. To enforce semantic separation between the two components, we further introduce a Dual-View Alignment Loss, which aligns low-frequency information with the periodic branch and high-frequency information with the residual branch. Extensive experiments on four real-world traffic datasets demonstrate that HyperD achieves state-of-the-art prediction accuracy, while offering superior robustness under disturbances and improved computational efficiency compared to existing methods.

[761] From Model Training to Model Raising

Roland Aydin, Christian Cyron, Steve Bachelor, Ashton Anderson, Robert West

Main category: cs.AI

TL;DR: Proposes shifting from “model training” to “model raising” by integrating alignment into model development from the start through redesigned training corpora.

DetailsMotivation: Current AI training methods align models with human values only after core capabilities are established, resulting in easily misaligned models lacking deep-rooted value systems.

Method: Redesign training corpus with four key components: reframing data from first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding data ordering.

Result: Expected to create models with early commitment to values from the first training token, making knowledge, skills, and values intrinsically harder to separate.

Conclusion: This paradigm shift is critical as LLM capabilities increasingly surpass human capabilities in many tasks.

Abstract: Current AI training methods align models with human values only after their core capabilities have been established, resulting in models that are easily misaligned and lack deep-rooted value systems. We propose a paradigm shift from “model training” to “model raising”, in which alignment is woven into a model’s development from the start. We identify several key components for this paradigm, all centered around redesigning the training corpus: reframing training data from a first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding the ordering of training data. We expect that this redesign of the training corpus will lead to an early commitment to values from the first training token onward, such that knowledge, skills, and values are intrinsically much harder to separate. In an ecosystem in which large language model capabilities start overtaking human capabilities in many tasks, this seems to us like a critical need.

[762] Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems

Jiahuan Long, Tingsong Jiang, Hanqing Liu, Chao Ma, Wen Yao

Main category: cs.AI

TL;DR: A thermally activated adversarial wearable that uses thermochromic dyes and heating units to create dynamic patterns on clothing, enabling users to evade AI surveillance systems while maintaining normal appearance when not activated.

DetailsMotivation: To address the conspicuous appearance limitation of traditional adversarial patches and create a more practical privacy-preserving solution that can adapt to real-world surveillance environments across both visible and infrared modalities.

Method: Integration of thermochromic dyes with flexible heating units on clothing surfaces to create dynamically activated adversarial patterns. The system appears as ordinary black clothing when inactive, but reveals hidden adversarial patterns when heated.

Result: The system achieves rapid texture activation within 50 seconds and maintains over 80% adversarial success rate across diverse real-world surveillance environments, effectively evading detection in both visible and infrared modalities.

Conclusion: This work demonstrates a physically grounded, user-controllable anti-AI system that provides a practical pathway for proactive privacy protection against ubiquitous AI surveillance through thermally activated adversarial wearables.

Abstract: Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.

[763] EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services

Keshara Weerasinghe, Xueren Ge, Tessa Heick, Lahiru Nuwan Wijayasingha, Anthony Cortez, Abhishek Satpathy, John Stankovic, Homa Alemzadeh

Main category: cs.AI

TL;DR: EgoEMS is the first comprehensive egocentric dataset for EMS training, featuring 20+ hours of simulated emergency scenarios with multimodal annotations to support AI cognitive assistants for first responders.

DetailsMotivation: Emergency responders face intense cognitive demands in high-stakes situations, and AI assistants could help reduce this burden by supporting real-time data collection and decision making.

Method: Created EgoEMS dataset with 233 simulated EMS scenarios from 62 participants (including 46 professionals) using an open-source, low-cost data collection system with multimodal annotations including keysteps, audio transcripts, action quality metrics, and segmentation masks.

Result: Developed a high-fidelity, multimodal dataset capturing realistic EMS activities with responder-patient interactions, plus benchmark tasks for keystep recognition and action quality estimation.

Conclusion: EgoEMS provides foundational resources to advance AI support tools for EMS, with potential to improve patient outcomes through better intelligent emergency response systems.

Abstract: Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.

[764] ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

Risha Surana, Qinyuan Ye, Swabha Swayamdipta

Main category: cs.AI

TL;DR: Language models show potential for assisting emergency responders with HAZMAT incidents but require human oversight due to current limitations in chemical information processing and emergency response recommendations.

DetailsMotivation: Emergency responders face critical, time-sensitive decisions during HAZMAT incidents and need to manually navigate extensive chemical guidelines, creating a need for AI assistance.

Method: Created ChEmREF benchmark with 1,035 HAZMAT chemicals from Emergency Response Guidebook and PubChem Database, testing three tasks: chemical representation translation, emergency response generation, and domain knowledge QA.

Result: Best models achieved 68.0% exact match on chemical translation, 52.7% LLM Judge score on response recommendations, and 63.9% accuracy on HAZMAT exams.

Conclusion: Language models demonstrate potential for emergency response assistance but current performance limitations necessitate careful human oversight in real-world applications.

Abstract: Emergency responders managing hazardous material HAZMAT incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today’s language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and providing recommendations. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising questions on 1,035 HAZMAT chemicals from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) translation of chemical representation between structured and unstructured forms (e.g., converting C2H6O to ethanol), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best evaluated models received an exact match of 68.0% on unstructured HAZMAT chemical representation translation, a LLM Judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HAMZAT examinations. These findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.

[765] Advanced Black-Box Tuning of Large Language Models with Limited API Calls

Zhikang Xie, Weilin Wan, Peizhu Gong, Weizhong Zhang, Cheng Jin

Main category: cs.AI

TL;DR: Proposes a novel black-box tuning method using Gaussian Process surrogate models with minimal API calls to efficiently adapt large language models while maintaining high accuracy.

DetailsMotivation: Current black-box tuning methods face a dilemma: either use inefficient proxy models with limited improvement, or make expensive API calls for each iteration. There's a need for a method that balances efficiency and performance.

Method: Trains a Gaussian Process surrogate model using “LogitMap Pairs” from minimal but informative training data, which approximates foundation model outputs to guide proxy model training and reduce API queries.

Result: Achieves 86.85% accuracy (from 55.92% baseline) with only 1.38% API query frequency, significantly outperforming offline methods and matching query-intensive approaches with much lower costs.

Conclusion: Provides a robust and high-efficiency paradigm for language model adaptation that balances performance and computational costs effectively.

Abstract: Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors, particularly when direct access to model parameters is unavailable. Current strategies, however, often present a dilemma of suboptimal extremes: either separately train a small proxy model and then use it to shift the predictions of the foundation model, offering notable efficiency but often yielding limited improvement; or making API calls in each tuning iteration to the foundation model, which entails prohibitive computational costs. Therefore, we propose a novel advanced black-box tuning method for LLMs with limited API calls. Our core strategy involves training a Gaussian Process (GP) surrogate model with “LogitMap Pairs” derived from querying the foundation model on a minimal but highly informative training subset. This surrogate can approximate the outputs of the foundation model to guide the training of the proxy model, thereby effectively reducing the need for direct queries to the foundation model. Extensive experiments verify that our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%. This significantly outperforms offline approaches that operate entirely without API access. Notably, our method also achieves comparable or superior accuracy to query-intensive approaches, while significantly reducing API costs. This offers a robust and high-efficiency paradigm for language model adaptation.

[766] MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion

Haolong Xiang, Peisi Wang, Xiaolong Xu, Kun Yi, Xuyun Zhang, Quanzheng Sheng, Amin Beheshti, Wei Fan

Main category: cs.AI

TL;DR: MTP is a multimodal framework for urban traffic profiling that integrates numeric, visual, and textual perspectives to enhance traffic signal understanding and prediction.

DetailsMotivation: Existing traffic signal modeling methods rely only on numerical sensor data, overlooking semantic information from multimodal urban data, which limits comprehensive understanding and accurate prediction of traffic dynamics.

Method: Transform traffic signals into frequency images and periodicity images for visual learning; augment descriptive texts based on topic, background, and item description for textual learning; use frequency multilayer perceptrons for numeric learning; employ hierarchical contrastive learning to fuse the three modalities.

Result: Extensive experiments on six real-world datasets demonstrate superior performance compared to state-of-the-art approaches.

Conclusion: The MTP framework effectively integrates multimodal perspectives to improve traffic signal profiling and prediction accuracy.

Abstract: With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic signal modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel Multimodal framework, MTP, for urban Traffic Profiling, which learns multimodal features through numeric, visual, and textual perspectives. The three branches drive for a multimodal perspective of urban traffic signal learning in the frequency domain, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic signals, which transforms the original modality into frequency images and periodicity images for visual learning. Also, we augment descriptive texts for the traffic signals based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the spectrum of three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.

[767] Non-Monotonic S4F Standpoint Logic (Extended Version with Proofs)

Piotr Gorczyca, Hannes Strass

Main category: cs.AI

TL;DR: S4F Standpoint Logic combines standpoint logic with non-monotonic reasoning using modal logic S4F, creating a unified framework for multi-viewpoint non-monotonic reasoning without increased computational complexity.

DetailsMotivation: To create a unified formalism that can represent multiple heterogeneous viewpoints while incorporating non-monotonic reasoning capabilities, bridging standpoint logics and non-monotonic reasoning frameworks.

Method: Developed S4F Standpoint Logic by generalizing both S4F modal logic and standpoint propositional logic, defining its syntax and semantics, and analyzing computational complexity.

Result: S4F Standpoint Logic is not computationally harder than its constituent logics in both monotonic and non-monotonic forms, with mechanisms for credulous and sceptical acceptance demonstrated through examples.

Conclusion: The proposed S4F Standpoint Logic successfully integrates multi-viewpoint representation with non-monotonic reasoning while maintaining computational tractability comparable to its underlying logics.

Abstract: Standpoint logics offer unified modal logic-based formalisms for representing multiple heterogeneous viewpoints. At the same time, many non-monotonic reasoning frameworks can be naturally captured using modal logics, in particular using the modal logic S4F. In this work, we propose a novel formalism called S4F Standpoint Logic, which generalises both S4F and standpoint propositional logic and is therefore capable of expressing multi-viewpoint, non-monotonic semantic commitments. We define its syntax and semantics and analyze its computational complexity, obtaining the result that S4F Standpoint Logic is not computationally harder than its constituent logics, whether in monotonic or non-monotonic form. We also outline mechanisms for credulous and sceptical acceptance and illustrate the framework with an example.

[768] ARCTraj: A Dataset and Benchmark of Human Reasoning Trajectories for Abstract Problem Solving

Sejin Kim, Hayan Choi, Seokki Lee, Sundong Kim

Main category: cs.AI

TL;DR: ARCTraj is a dataset and framework for modeling human reasoning in visual tasks, capturing intermediate steps through object-level actions to reveal how humans transform inputs to outputs over time.

DetailsMotivation: Existing approaches in ARC rely on static input-output supervision, which limits insight into temporal reasoning processes and intermediate steps that conventional datasets overlook.

Method: Collected via O2ARC web interface with 10,000 trajectories across 400 training tasks, using object-level actions with timestamps and success labels, and defining a unified reasoning pipeline with MDP formulation for integration with various learning methods.

Result: The dataset enables analysis of spatial selection, color attribution, and strategic convergence, revealing the structure and diversity of human reasoning patterns in complex visual tasks.

Conclusion: ARCTraj provides a structured foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence through interpretable reasoning trajectories.

Abstract: We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input–output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object-level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC-AGI-1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence.

[769] A Workflow for Full Traceability of AI Decisions

Julius Wenzel, Syeda Umaima Alam, Andreas Schmidt, Hanwei Zhang, Holger Hermanns

Main category: cs.AI

TL;DR: This paper presents a workflow for generating tamper-proof, verifiable traces of AI decisions to address the lack of documentation in automated decision systems that could cause harm to people.

DetailsMotivation: The increasing use of brittle AI systems in high-stakes decisions creates substantial risks of harm to people's well-being and fundamental rights, with current systems lacking proper documentation that would enable tracing decision processes and establishing responsibility chains.

Method: The paper enforces documentation of every component in AI training and inference, expanding the DBOM concept into a running workflow using confidential computing technology to generate tamper-proof and exhaustive decision traces.

Result: The authors demonstrate a working workflow through development of a mushroom classification app (poisonous vs edible) as a playful example of high-stakes decision support, showing the system’s ability to create verifiable decision traces.

Conclusion: The approach provides a practical solution for creating legally defensible documentation of AI decisions that can stand up in court, addressing the critical need for traceability and accountability in automated decision systems.

Abstract: An ever increasing number of high-stake decisions are made or assisted by automated systems employing brittle artificial intelligence technology. There is a substantial risk that some of these decision induce harm to people, by infringing their well-being or their fundamental human rights. The state-of-the-art in AI systems makes little effort with respect to appropriate documentation of the decision process. This obstructs the ability to trace what went into a decision, which in turn is a prerequisite to any attempt of reconstructing a responsibility chain. Specifically, such traceability is linked to a documentation that will stand up in court when determining the cause of some AI-based decision that inadvertently or intentionally violates the law. This paper takes a radical, yet practical, approach to this problem, by enforcing the documentation of each and every component that goes into the training or inference of an automated decision. As such, it presents the first running workflow supporting the generation of tamper-proof, verifiable and exhaustive traces of AI decisions. In doing so, we expand the DBOM concept into an effective running workflow leveraging confidential computing technology. We demonstrate the inner workings of the workflow in the development of an app to tell poisonous and edible mushrooms apart, meant as a playful example of high-stake decision support.

cs.SD

[770] Lightweight Hopfield Neural Networks for Bioacoustic Detection and Call Monitoring of Captive Primates

Wendy Lomas, Andrew Gascoyne, Colin Dubreuil, Stefano Vaglio, Liam Naughton

Main category: cs.SD

TL;DR: A lightweight Hopfield neural network model for passive acoustic monitoring that detects lemur vocalizations with 94% accuracy, processing 5.5 hours of audio per minute on standard hardware.

DetailsMotivation: Current acoustic monitoring methods use resource-intensive convolutional neural networks that require large labeled datasets and lack flexibility. There's a need for faster, more transparent alternatives for wildlife monitoring.

Method: Adapted a Hopfield neural network (HNN) architecture originally developed for bat echolocation detection to monitor captive lemur vocalizations. Stored target lemur social calls in the HNN and improved the model by adding movement signal detection.

Result: Achieved 94% overall accuracy, can perform 340 classifications per second, processing over 5.5 hours of audio data per minute on a standard laptop. The model trains in milliseconds.

Conclusion: This lightweight associative memory model provides a fast, transparent alternative to CNNs for acoustic monitoring, reducing data-to-insight turnaround times and accelerating decision making in both captive and wild settings.

Abstract: Passive acoustic monitoring is a sustainable method of monitoring wildlife and environments that leads to the generation of large datasets and, currently, a processing backlog. Academic research into automating this process is focused on the application of resource intensive convolutional neural networks which require large pre-labelled datasets for training and lack flexibility in application. We present a viable alternative relevant in both wild and captive settings; a transparent, lightweight and fast-to-train associative memory AI model with Hopfield neural network (HNN) architecture. Adapted from a model developed to detect bat echolocation calls, this model monitors captive endangered black-and-white ruffed lemur Varecia variegata vocalisations. Lemur social calls of interest when monitoring welfare are stored in the HNN in order to detect other call instances across the larger acoustic dataset. We make significant model improvements by storing an additional signal caused by movement and achieve an overall accuracy of 0.94. The model can perform $340$ classifications per second, processing over 5.5 hours of audio data per minute, on a standard laptop running other applications. It has broad applicability and trains in milliseconds. Our lightweight solution reduces data-to-insight turnaround times and can accelerate decision making in both captive and wild settings.

[771] Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion

Behnaz Bahmei, Siamak Arzanpour, Elina Birmingham

Main category: cs.SD

TL;DR: A transformer-based framework for real-time single-channel noise suppression that effectively handles non-stationary noise using dual-input acoustic-image feature fusion with a hybrid ViT architecture.

DetailsMotivation: Existing deep learning methods perform well on stationary noise but struggle with real-world non-stationary noise (e.g., dog barking, baby crying), and there's a need for computationally efficient solutions suitable for embedded devices.

Method: Dual-input acoustic-image feature fusion using a hybrid Vision Transformer (ViT) framework that models both temporal and spectral dependencies in noisy signals, designed to be computationally lightweight for real-time embedded applications.

Result: The method significantly improves noise reduction, speech intelligibility, and perceptual quality compared to noisy input, achieving performance close to clean reference signals on Librispeech, UrbanSound8K, and Google Audioset datasets using PESQ, STOI, Seg SNR, and LLR metrics.

Conclusion: The proposed transformer-based framework effectively addresses non-stationary noise suppression in real-time applications while maintaining computational efficiency suitable for embedded devices.

Abstract: Speech quality and intelligibility are significantly degraded in noisy environments. This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem for real-time applications. Although existing deep learning networks have shown remarkable improvements in handling stationary noise, their performance often diminishes in real-world environments characterized by non-stationary noise (e.g., dog barking, baby crying). The proposed dual-input acoustic-image feature fusion using a hybrid ViT framework effectively models both temporal and spectral dependencies in noisy signals. Designed for real-world audio environments, the proposed framework is computationally lightweight and suitable for implementation on embedded devices. To evaluate its effectiveness, four standard and commonly used quality measurements, namely PESQ, STOI, Seg SNR, and LLR, are utilized. Experimental results obtained using the Librispeech dataset as the clean speech source and the UrbanSound8K and Google Audioset datasets as the noise sources, demonstrate that the proposed method significantly improves noise reduction, speech intelligibility, and perceptual quality compared to the noisy input signal, achieving performance close to the clean reference.

[772] MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu, Youqing Fang, Pingyu Wu, Guoyang Ye, Wenbo Zhou, Weiming Zhang, Song Xiao

Main category: cs.SD

TL;DR: MF-Speech is a novel framework that decomposes speech into pure content, timbre, and emotion factors using a multi-objective optimization encoder, then achieves fine-grained control through dynamic fusion and hierarchical normalization.

DetailsMotivation: Overcoming the fundamental challenges of deep entanglement of speech factors and coarse granularity in existing speech control mechanisms.

Method: Two-component framework: MF-SpeechEncoder (factor purifier using multi-objective optimization) and MF-SpeechGenerator (conductor using dynamic fusion and Hierarchical Style Adaptive Normalization).

Result: Significantly outperforms SOTA methods with WER=4.67%, SECS=0.5685, Corr=0.68, and high subjective scores (nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Learned factors show strong transferability.

Conclusion: MF-Speech successfully addresses speech factor entanglement and enables precise, composable fine-grained control, with learned factors showing potential as general-purpose speech representations.

Abstract: Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

[773] Towards Practical Real-Time Low-Latency Music Source Separation

Junyu Wu, Jie Liu, Tianrui Pan, Jie Tang, Gangshan Wu

Main category: cs.SD

TL;DR: Introduces RT-STT, a lightweight real-time low-latency music demixing model based on single-path TFC-TDF UNET architecture with channel expansion feature fusion and quantization for faster inference.

DetailsMotivation: Address the gap in real-time, low-latency music demixing applications (hearing aids, live performances) and counter the trend of increasingly large models that limit practical deployment.

Method: Proposes RT-STT model using single-path TFC-TDF UNET with channel expansion feature fusion technique, demonstrates superiority of single-path over dual-path for real-time, and applies quantization to reduce inference time.

Result: RT-STT achieves superior performance with significantly fewer parameters and shorter inference times compared to state-of-the-art models.

Conclusion: The proposed lightweight RT-STT model effectively addresses real-time music demixing needs with efficient architecture and optimization techniques.

Abstract: In recent years, significant progress has been made in the field of deep learning for music demixing. However, there has been limited attention on real-time, low-latency music demixing, which holds potential for various applications, such as hearing aids, audio stream remixing, and live performances. Additionally, a notable tendency has emerged towards the development of larger models, limiting their applicability in certain scenarios. In this paper, we introduce a lightweight real-time low-latency model called Real-Time Single-Path TFC-TDF UNET (RT-STT), which is based on the Dual-Path TFC-TDF UNET (DTTNet). In RT-STT, we propose a feature fusion technique based on channel expansion. We also demonstrate the superiority of single-path modeling over dual-path modeling in real-time models. Moreover, we investigate the method of quantization to further reduce inference time. RT-STT exhibits superior performance with significantly fewer parameters and shorter inference times compared to state-of-the-art models.

[774] FoleyBench: A Benchmark For Video-to-Audio Models

Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue

Main category: cs.SD

TL;DR: FoleyBench is a new benchmark for video-to-audio generation focused on Foley sound effects, addressing limitations in existing datasets that lack proper audio-visual correspondence and are dominated by speech/music.

DetailsMotivation: Current V2A evaluation datasets have poor audio-visual correspondence (74% of videos) and are dominated by speech/music, creating a mismatch with Foley sound applications in film, AR/VR, and sound design.

Method: Created FoleyBench with 5,000 video-audio-text triplets using automated pipeline from YouTube/Vimeo videos, featuring visible sound sources with causal audio-visual relationships and comprehensive metadata labeling.

Result: FoleyBench provides stronger coverage of Foley sound categories compared to past datasets, enabling fine-grained analysis of model performance across audio quality, alignment, synchronization, and text consistency.

Conclusion: FoleyBench fills a critical gap in V2A evaluation by providing the first large-scale benchmark specifically designed for Foley-style scenarios, enabling more accurate assessment of models for practical applications.

Abstract: Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench

[775] Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs

Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

Main category: cs.SD

TL;DR: Current Audio-Language Models (LALMs) show systematic motion perception deficits, struggling to infer sound source direction and trajectory from binaural audio, with accuracy below 50%.

DetailsMotivation: To investigate whether LALMs can perceive spatial dynamics and motion of sound sources, which remains unclear despite their progress in other auditory tasks.

Method: Introduces AMPBench, the first benchmark for auditory motion understanding, using controlled question-answering with binaural audio to evaluate directional and trajectory inference capabilities.

Result: Models struggle to recognize motion cues or distinguish directional patterns, with comprehensive analyses revealing fundamental limitations in auditory spatial reasoning.

Conclusion: There is a fundamental gap between human and model auditory spatial reasoning, highlighting the need for enhanced spatial cognition in future LALMs.

Abstract: Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

[776] DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan

Main category: cs.SD

TL;DR: DRAGON is a flexible framework for fine-tuning media generation models that can optimize both instance-wise and distributional rewards, achieving strong performance across 20 different reward functions including human-perceived music quality.

DetailsMotivation: Traditional RLHF and pairwise preference approaches like DPO are limited in their flexibility. DRAGON aims to provide a more versatile framework that can handle a broader range of reward functions including instance-wise, instance-to-distribution, and distribution-to-distribution rewards.

Method: DRAGON constructs reward functions using encoders and reference examples to create exemplar distributions. It gathers online generations, scores them to create positive/negative demonstration sets, and uses contrast between these sets to approximate distributional reward optimization.

Result: DRAGON achieved an 81.45% average win rate across 20 target rewards. With appropriate exemplar sets, it achieved 60.95% human-voted music quality win rate without training on human preference annotations. Reward functions based on exemplar sets performed comparably to model-based rewards.

Conclusion: DRAGON provides a new approach to designing and optimizing reward functions for improving human-perceived quality in media generation, demonstrating versatility across different reward types and modalities.

Abstract: We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. DRAGON is a new approach to designing and optimizing reward functions for improving human-perceived quality. Demos at https://ml-dragon.github.io/web

[777] DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Main category: cs.SD

TL;DR: Proposes DualSpeechLM, a unified speech understanding and generation model using USTokenizer for semantic speech tokens and dual-token modeling to bridge modality gaps between speech and text.

DetailsMotivation: Extending text LLMs to handle speech faces challenges: large modality gap requiring extensive fine-tuning data, and conflicting requirements between understanding (needs high-level semantics) and generation (needs acoustic details).

Method: 1) USTokenizer extracts high-level semantic speech tokens compatible with text LLMs; 2) Dual-token modeling framework processes USToken as input and acoustic tokens as output; 3) Semantic supervision loss and Chain-of-Condition strategy for stable training.

Result: Experimental results show effective complementary relationship between understanding and generation tasks, demonstrating mutual enhancement in unified model.

Conclusion: The proposed approach successfully integrates speech understanding and generation in one model, showing promising strategy for unified speech-language modeling.

Abstract: Extending pre-trained text Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

[778] Multi-Metric Preference Alignment for Generative Speech Restoration

Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu

Main category: cs.SD

TL;DR: This paper proposes a multi-metric preference alignment strategy for generative speech restoration models to address misalignment with human perceptual preferences, achieving consistent performance gains across different generative paradigms.

DetailsMotivation: Current generative models for speech restoration often misalign with human perceptual preferences, resulting in suboptimal quality. Post-training alignment has been effective in other domains but remains under-explored for speech restoration.

Method: Proposed multi-metric preference alignment strategy using Direct Preference Optimization (DPO) with a new dataset (GenSR-Pref) containing 80K preference pairs selected by complementary metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation.

Result: Consistent and significant performance gains across three generative paradigms (autoregressive, masked generative, flow-matching models) on various restoration benchmarks in both objective and subjective evaluations. Aligned models can also generate high-quality pseudo-labels for discriminative models.

Conclusion: The multi-metric preference alignment strategy effectively mitigates reward hacking and improves speech restoration quality, demonstrating the value of principled preference alignment for generative speech models.

Abstract: Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ‘‘data annotators’’, generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io

[779] Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis

Junnuo Wang

Main category: cs.SD

TL;DR: Audio Palette is a diffusion transformer model that enables fine-grained acoustic control in text-to-audio synthesis using four time-varying control signals (loudness, pitch, spectral centroid, timbre) while maintaining audio quality and semantic alignment.

DetailsMotivation: Address the 'control gap' in open-source text-to-audio synthesis where fine-grained acoustic control remains challenging, enabling more precise and interpretable manipulation of sound attributes for artist-centric workflows.

Method: Extends Stable Audio Open architecture with diffusion transformer (DiT), introduces four time-varying control signals, uses Low-Rank Adaptation (LoRA) for efficient fine-tuning on AudioSet subset (0.85% parameters), and implements three-scale classifier-free guidance.

Result: Achieves fine-grained interpretable control of sound attributes while maintaining comparable audio quality (FAD, LAION-CLAP scores) to baseline, enabling precise manipulation of acoustic features with strong semantic alignment to text prompts.

Conclusion: Establishes a robust foundation for controllable sound design in open-source settings with scalable, modular pipeline that emphasizes sequence-based conditioning, memory efficiency, and nuanced inference-time control for performative audio synthesis.

Abstract: Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this “control gap” in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signals: loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85 percent of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a three-scale classifier-free guidance mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow.

[780] MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

Ali Boudaghi, Hadi Zare

Main category: cs.SD

TL;DR: MusRec is a zero-shot text-to-music editing model that performs diverse editing tasks on real-world music using rectified flow and diffusion transformers, outperforming existing methods.

DetailsMotivation: Existing music editing models are limited to synthesized music, require precise prompts, or need task-specific retraining, lacking true zero-shot capability for real-world music editing.

Method: Leverages recent advances in rectified flow and diffusion transformers to create a zero-shot text-to-music editing model that works on real-world music.

Result: Outperforms existing methods in preserving musical content, structural consistency, and editing fidelity across diverse editing tasks.

Conclusion: Establishes a strong foundation for controllable music editing in real-world scenarios with true zero-shot capability.

Abstract: Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.

[781] AcousTools: A ‘Full-Stack’, Python-Based, Acoustic Holography Library

Joshua Mukherjee, Giorgos Christopoulos, Zhouyang Shen, Sriram Subramanian, Ryuji Hirayama

Main category: cs.SD

TL;DR: AcousTools is a Python-based acoustic holography library that provides a full-stack solution for acoustic holography applications, covering setup, modeling, phase retrieval, analysis, and hardware control.

DetailsMotivation: There is currently no single software that provides a complete solution for acoustic holography applications, with existing methods failing to cover all aspects from abstraction to physicalization.

Method: Developed AcousTools, a Python library designed to support the full suite of acoustic holographic applications including setup, acoustic propagation modeling, transducer phase retrieval, sound field analysis, and hardware control.

Result: AcousTools successfully meets each step of the full-stack requirements for acoustic holography and has the potential to become the standard code library in this field.

Conclusion: AcousTools provides a uniquely complete and easy-to-use solution that will enable researchers to develop novel applications and accurately review others’ work, while also providing a framework for comparing methodologies.

Abstract: Acoustic Holography is an emerging field where mid-air ultrasound is controlled and manipulated for novel and exciting applications. These range from mid-air haptics, volumetric displays, contactless fabrication, and even chemical and biomedical applications such as drug delivery. To develop these applications, a software framework to predict acoustic behaviour and simulating resulting effects, such as applied forces or scattering patterns is desirable. There have been various software libraries and platforms that attempt to fill this role, but there is yet to be a single piece of software that acts as a ‘full-stack’ solution. We define this full-stack as the process from abstraction to physicalisation starting with setup, modelling acoustic propagation, transducer phase retrieval, sound field analysis, and control of the acoustic holographic hardware itself. Existing methods fail to fulfil one or more of these categories. To address this, we present AcousTools, a Python-based acoustic holography library, designed to support the full suite of acoustic holographic applications and we show AcousTools’s ability to meet each step of the full-stack’s requirements. AcousTools has the potential to become the standard code library for acoustic holography, with the uniquely complete suite of features wrapped in a language that is known to be easy to use, AcousTools will increase the ability for researchers to develop novel applications as well as accurately review other’s work. The full-stack, aside from software, will also be useful for researchers - providing a way to view and compare methodologies by understanding where they fit into the stack.

[782] HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li

Main category: cs.SD

TL;DR: HQ-SVC is an efficient framework for high-quality zero-shot singing voice conversion that jointly models content and speaker features using a decoupled codec, enhances fidelity through pitch/volume modeling, and progressively refines outputs via differentiable signal processing and diffusion.

DetailsMotivation: Existing zero-shot SVC methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources.

Method: Extracts jointly content and speaker features using a decoupled codec, enhances fidelity through pitch and volume modeling, and progressively refines outputs via differentiable signal processing and diffusion techniques.

Result: Significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency, and achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

Conclusion: HQ-SVC provides an efficient and high-quality solution for zero-shot singing voice conversion that preserves critical acoustic information typically lost in separate modeling approaches.

Abstract: Zero-shot singing voice conversion (SVC) transforms a source singer’s timbre to an unseen target speaker’s voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

[783] DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai

Main category: cs.SD

TL;DR: DialogGraph-LLM is an end-to-end framework that combines a Multi-Relational Dialogue Attention Network with multimodal foundation models for speaker intent recognition in long audio dialogues, using adaptive semi-supervised learning with confidence-aware pseudo-labeling.

DetailsMotivation: Speaker intent recognition in long audio dialogues is challenging due to complex inter-dependencies in speaker utterances and scarce annotated data, but has wide applications.

Method: Proposes DialogGraph-LLM with MR-DAN architecture and multimodal foundation models for acoustic-to-intent inference, plus adaptive semi-supervised learning with dual-threshold pseudo-label generation and entropy-based sample selection.

Result: Extensive evaluations on MarketCalls and MIntRec 2.0 benchmarks show superiority over audio and text-driven baselines, with strong performance and efficiency in real-world audio dialogues.

Conclusion: The framework proves practical value for audio-rich domains with limited supervision, effectively addressing intent recognition challenges in real-world scenarios.

Abstract: Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.

cs.LG

[784] Softmax as a Lagrangian-Legendrian Seam

Christopher R. Lee-Jenkins

Main category: cs.LG

TL;DR: The paper bridges machine learning and differential geometry by modeling softmax as a geometric interface with Legendrian seams and contact screens, revealing bias-shift invariance as Reeb flow and providing geometric interpretations for ML concepts.

DetailsMotivation: To establish a connection between machine learning (specifically softmax/logits-to-probabilities) and modern differential geometry, providing geometric interpretations for ML operations and concepts.

Method: Model softmax as a geometric interface with two potential-generated conservative descriptions meeting along a Legendrian seam on a contact screen (probability simplex) within a folded symplectic collar. Analyze bias-shift invariance as Reeb flow and use Fenchel-Young equality/KL gap as computable distance to the seam.

Result: Successfully modeled softmax geometrically, showing bias-shift invariance corresponds to Reeb flow on the probability simplex screen, with Fenchel-Young equality providing distance measurements. Concrete examples worked out for 2- and 3-class cases.

Conclusion: Established a geometric framework for understanding softmax operations, opening avenues for compact logit models, global invariants, and connections to information geometry where on-screen dynamics manifest as replicator flows.

Abstract: This note offers a first bridge from machine learning to modern differential geometry. We show that the logits-to-probabilities step implemented by softmax can be modeled as a geometric interface: two potential-generated, conservative descriptions (from negative entropy and log-sum-exp) meet along a Legendrian “seam” on a contact screen (the probability simplex) inside a simple folded symplectic collar. Bias-shift invariance appears as Reeb flow on the screen, and the Fenchel-Young equality/KL gap provides a computable distance to the seam. We work out the two- and three-class cases to make the picture concrete and outline next steps for ML: compact logit models (projective or spherical), global invariants, and connections to information geometry where on-screen dynamics manifest as replicator flows.

[785] LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora

Viviana Luccioli, Rithika Iyengar, Ryan Panley, Flora Haberkorn, Xiaoyu Ge, Leland Crane, Nitish Sinha, Seung Jung Lee

Main category: cs.LG

TL;DR: M-RARU is an active learning algorithm that reduces LLM distillation costs by 80% through uncertainty-based sampling with randomized accept-reject mechanism.

DetailsMotivation: High computational and financial costs of deploying LLMs in dynamic environments, and expensive distillation processes requiring large labeled datasets.

Method: M-RARU combines uncertainty sampling with randomized accept-reject mechanism to select only the most informative data points for LLM teacher labeling.

Result: Achieves up to 80% reduction in sample requirements compared to random sampling, while maintaining classification accuracy across five student models on multiple datasets.

Conclusion: M-RARU enables efficient student model creation at significantly reduced costs while preserving LLM performance.

Abstract: Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM “teacher” trains a smaller and more efficient “student” model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLM’s performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.

[786] Detecting Statistically Significant Fairness Violations in Recidivism Forecasting Algorithms

Animesh Joshi

Main category: cs.LG

TL;DR: This paper introduces a statistical testing framework for assessing the significance of algorithmic fairness violations using k-fold cross-validation, applied to recidivism forecasting algorithms.

DetailsMotivation: Existing fairness literature lacks methods to determine if observed disparities between groups are statistically significant or just due to chance, creating a need for rigorous statistical testing of algorithmic bias.

Method: Leveraging k-fold cross-validation to generate sampling distributions of fairness metrics, creating statistical tests for fairness violations based on disparities, model calibration, and causal inference techniques.

Result: Recidivism forecasting algorithms show statistically significant bias against Black individuals under several fairness definitions, while showing no bias or bias against White individuals under other definitions.

Conclusion: Rigorous statistical testing is crucial for evaluating algorithmic decision-making systems, as different fairness definitions can yield contradictory bias assessments.

Abstract: Machine learning algorithms are increasingly deployed in critical domains such as finance, healthcare, and criminal justice [1]. The increasing popularity of algorithmic decision-making has stimulated interest in algorithmic fairness within the academic community. Researchers have introduced various fairness definitions that quantify disparities between privileged and protected groups, use causal inference to determine the impact of race on model predictions, and that test calibration of probability predictions from the model. Existing literature does not provide a way in which to assess whether observed disparities between groups are statistically significant or merely due to chance. This paper introduces a rigorous framework for testing the statistical significance of fairness violations by leveraging k-fold cross-validation [2] to generate sampling distributions of fairness metrics. This paper introduces statistical tests that can be used to identify statistically significant violations of fairness metrics based on disparities between predicted and actual outcomes, model calibration, and causal inference techniques [1]. We demonstrate this approach by testing recidivism forecasting algorithms trained on data from the National Institute of Justice. Our findings reveal that machine learning algorithms used for recidivism forecasting exhibit statistically significant bias against Black individuals under several fairness definitions, while also exhibiting no bias or bias against White individuals under other definitions. The results from this paper underscore the importance of rigorous and robust statistical testing while evaluating algorithmic decision-making systems.

[787] DAOpt: Modeling and Evaluation of Data-Driven Optimization under Uncertainty with LLMs

WenZhuo Zhu, Zheng Cui, Wenhan Lu, Sheng Liu, Yue Zhao

Main category: cs.LG

TL;DR: DAOpt framework for applying LLMs to uncertain optimization problems, featuring new dataset OptU, multi-agent decision-making, and simulation environment with focus on out-of-sample performance.

DetailsMotivation: Most existing LLM research focuses on deterministic optimization with known parameters, leaving uncertain decision-making settings largely unexplored despite real-world optimization being inherently uncertain.

Method: Proposed DAOpt framework with: (1) new dataset OptU, (2) multi-agent decision-making module, (3) simulation environment for evaluation, and (4) enhanced LLM modeling through few-shot learning with stochastic and robust optimization domain knowledge.

Result: Framework enables evaluation of LLMs on uncertain optimization problems with emphasis on out-of-sample feasibility and robustness.

Conclusion: DAOpt addresses the gap in applying LLMs to uncertain optimization settings and enhances their modeling capabilities through domain-specific knowledge integration.

Abstract: Recent advances in large language models (LLMs) have accelerated research on automated optimization modeling. While real-world decision-making is inherently uncertain, most existing work has focused on deterministic optimization with known parameters, leaving the application of LLMs in uncertain settings largely unexplored. To that end, we propose the DAOpt framework including a new dataset OptU, a multi-agent decision-making module, and a simulation environment for evaluating LLMs with a focus on out-of-sample feasibility and robustness. Additionally, we enhance LLMs’ modeling capabilities by incorporating few-shot learning with domain knowledge from stochastic and robust optimization.

[788] To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

Wanlong Fang, Tianle Zhang, Alvin Chan

Main category: cs.LG

TL;DR: Explicit alignment between multimodal representations is not universally beneficial - its impact depends on data characteristics, particularly modality redundancy. Optimal alignment strength balances modality-specific signals with shared redundancy.

DetailsMotivation: Prior research only observed natural alignment in multimodal data without systematically studying explicit alignment effects. This work investigates how explicit alignment influences performance under different modality information structures.

Method: Introduces a controllable contrastive learning module to precisely manipulate alignment strength during training, tested on synthetic and real datasets with varying data characteristics.

Result: The impact of explicit alignment depends on data characteristics - optimal alignment strength varies with the amount of redundancy between modalities. Identified an optimal balance between modality-specific signals and shared redundancy.

Conclusion: Provides practical guidance on when and how to apply explicit alignment for optimal unimodal encoder performance, showing alignment should be tailored to modality redundancy levels.

Abstract: Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.

[789] Decoupling Positional and Symbolic Attention Behavior in Transformers

Felipe Urrutia, Jorge Salas, Alexander Kozachinskiy, Cristian Buc Calderon, Hector Pasten, Cristobal Rojas

Main category: cs.LG

TL;DR: This paper analyzes how Rotary Positional Encoding (RoPE) enables Transformers to encode positional and symbolic information using different frequency ranges, and shows that controlling frequency access causally affects model performance on positional vs symbolic tasks.

DetailsMotivation: To understand the dichotomy between positional and symbolic information encoding in Transformer attention heads using RoPE, and how different frequencies contribute to these behaviors.

Method: Developed theoretical definitions for positional vs symbolic head behaviors, created metrics to quantify them, analyzed Transformer LLMs with RoPE, and designed canonical tasks to test causal relationships between frequency access and performance.

Result: Found strong correlation between attention head behavior and frequency use - large frequencies for positional encoding, small frequencies for semantic encoding. Demonstrated causal control over Transformer performance by manipulating which frequencies attention heads can access.

Conclusion: RoPE’s success stems from its ability to separate positional and symbolic encoding through frequency allocation, providing a detailed understanding of how positional encoding properties relate to Transformer behavior.

Abstract: An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE’s success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.

[790] Regularized Schrödinger: Alleviating Distortion and Exposure Bias in Solving Inverse Problems

Qing Yao, Lijian Gao, Qirong Mao, Dong Ming

Main category: cs.LG

TL;DR: RSB is a novel diffusion-based method for inverse problems that addresses distortion-perception tradeoff and exposure bias through regularized training and Schrödinger Bridge adaptation.

DetailsMotivation: To overcome limitations in diffusion models for inverse problems: 1) distortion-perception tradeoff where better perceptual quality degrades fidelity, and 2) exposure bias from training-inference mismatch causing error accumulation.

Method: Regularized Schrödinger Bridge (RSB) with novel training strategy that perturbs both input states and targets, exposing model to simulated prediction errors and using posterior mean interpolation to reduce distortion.

Result: Extensive experiments on speech enhancement show RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

Conclusion: RSB successfully addresses key challenges in diffusion models for inverse problems, providing superior performance in both distortion reduction and exposure bias mitigation.

Abstract: Diffusion models serve as a powerful generative framework for solving inverse problems. However, they still face two key challenges: 1) the distortion-perception tradeoff, where improving perceptual quality often degrades reconstruction fidelity, and 2) the exposure bias problem, where the training-inference input mismatch leads to prediction error accumulation and reduced reconstruction quality. In this work, we propose the Regularized Schrödinger Bridge (RSB), an adaptation of Schrödinger Bridge tailored for inverse problems that addresses the above limitations. RSB employs a novel regularized training strategy that perturbs both the input states and targets, effectively mitigating exposure bias by exposing the model to simulated prediction errors and also alleviating distortion by well-designed interpolation via the posterior mean. Extensive experiments on two typical inverse problems for speech enhancement demonstrate that RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

[791] The Anatomy of a Triton Attention Kernel

Burkhard Ringlein, Jan van Lunteren, Radu Stoica, Thomas Parnell

Main category: cs.LG

TL;DR: Developed a portable LLM inference platform using Triton-based paged attention kernel that achieves state-of-the-art performance across NVIDIA and AMD GPUs without low-level hand-tuning.

DetailsMotivation: To create a portable LLM inference platform that works across hardware architectures, eliminates manual optimization, and maintains high efficiency.

Method: Built a paged attention kernel using Triton (domain-specific JIT-compiled language) with algorithmic/system improvements, parameter auto-tuning, and integration into popular inference servers.

Result: Improved performance from 19.7% to 105.9% of state-of-the-art, demonstrating cross-platform efficiency on both NVIDIA and AMD GPUs.

Conclusion: Open-source domain-specific languages like Triton can enable portable, efficient LLM inference across different GPU vendors without vendor-specific optimizations.

Abstract: A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 105.9%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.

[792] Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues

Seham Nasr, Zhao Ren, David Johnson

Main category: cs.LG

TL;DR: A framework for explainable AI in speech emotion recognition that links saliency maps to acoustic cues, improving interpretability over standard vision-based methods.

DetailsMotivation: Current saliency methods adapted from vision highlight spectrogram regions but fail to show if they correspond to meaningful acoustic emotion markers, limiting faithfulness and interpretability.

Method: Proposed framework quantifies cue magnitudes within salient regions, connecting saliency to expert-referenced acoustic cues of speech emotions.

Result: Experiments on benchmark SER datasets show improved explanation quality by explicitly linking salient regions to theory-driven speech emotion acoustics.

Conclusion: Provides more understandable and plausible explanations than standard saliency methods, offering a foundational step towards trustworthy speech-based affective computing.

Abstract: Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies “what” is highlighted and connects it to “why” it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.

[793] A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning

Liuyi Jin, Pasan Gunawardena, Amran Haroon, Runzhi Wang, Sangwoo Lee, Radu Stoleru, Michael Middleton, Zepeng Huo, Jeeeun Kim, Jason Moats

Main category: cs.LG

TL;DR: EMSGlass is a smart-glasses system with EMSNet (multimodal multitask model) and EMSServe (serving framework) that enhances EMT decision-making by integrating text, vital signs, and scene images for real-time emergency medical services.

DetailsMotivation: EMTs face high-pressure environments requiring rapid, life-critical decisions under heavy cognitive loads, necessitating AI systems to support real-time situational awareness and operational efficiency.

Method: Developed EMSNet (multimodal model integrating text, vital signs, scene images) and EMSServe (low-latency serving framework with modality-aware model splitter and feature caching) built on PyTorch.

Result: EMSNet supports 5 critical EMS tasks with superior accuracy over unimodal baselines. EMSServe achieves 1.9x-11.7x speedup over direct PyTorch inference. User study with 6 EMTs shows enhanced situational awareness and decision-making speed.

Conclusion: EMSGlass successfully bridges multimodal AI with real-world emergency response, providing actionable directions for next-generation AI-enabled EMS systems through intuitive on-glass interaction.

Abstract: Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x – 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.

[794] Parallel and Multi-Stage Knowledge Graph Retrieval for Behaviorally Aligned Financial Asset Recommendations

Fernando Spadea, Oshani Seneviratne

Main category: cs.LG

TL;DR: RAG-FLARKO extends FLARKO with retrieval-augmented generation using multi-stage KG retrieval to improve financial recommendations by reducing context overhead and enhancing relevance.

DetailsMotivation: Overcome limitations of LLMs in financial recommendations including context limits, hallucinations, and lack of behavioral grounding by building on prior FLARKO framework.

Method: Multi-stage parallel KG retrieval: first retrieves behaviorally relevant entities from user transaction KG, then filters temporally consistent signals from market KG to construct compact grounded subgraph for LLM.

Result: Significantly enhances recommendation quality on real-world financial dataset; enables smaller models to achieve high performance in profitability and behavioral alignment.

Conclusion: Presents viable path for deploying grounded financial AI in resource-constrained environments through efficient retrieval-augmented framework.

Abstract: Large language models (LLMs) show promise for personalized financial recommendations but are hampered by context limits, hallucinations, and a lack of behavioral grounding. Our prior work, FLARKO, embedded structured knowledge graphs (KGs) in LLM prompts to align advice with user behavior and market data. This paper introduces RAG-FLARKO, a retrieval-augmented extension to FLARKO, that overcomes scalability and relevance challenges using multi-stage and parallel KG retrieval processes. Our method first retrieves behaviorally relevant entities from a user’s transaction KG and then uses this context to filter temporally consistent signals from a market KG, constructing a compact, grounded subgraph for the LLM. This pipeline reduces context overhead and sharpens the model’s focus on relevant information. Empirical evaluation on a real-world financial transaction dataset demonstrates that RAG-FLARKO significantly enhances recommendation quality. Notably, our framework enables smaller, more efficient models to achieve high performance in both profitability and behavioral alignment, presenting a viable path for deploying grounded financial AI in resource-constrained environments.

[795] Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

Main category: cs.LG

TL;DR: Training models with output-only monitors can still lead to obfuscated chain-of-thought (CoT) reasoning through two mechanisms: generalization of safe-looking outputs to CoTs, and reinforcement of safe-looking CoTs due to token conditioning.

DetailsMotivation: To address limitations in previous approaches that only used output monitors, which still allow obfuscated CoTs to emerge through unintended mechanisms.

Method: Proposed two mitigation strategies targeting the identified mechanisms: preventing generalization of safe outputs to CoTs and addressing reinforcement of safe-looking CoTs through token conditioning.

Result: The mitigations achieved a Pareto improvement, simultaneously enhancing both monitorability of CoTs and task performance compared to standard training approaches.

Conclusion: Output-only monitoring is insufficient to prevent obfuscated CoTs; targeted mitigations are needed to address the specific mechanisms that enable obfuscation while maintaining model performance.

Abstract: OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training.

[796] Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge

Kabir Khan, Manju Sarkar, Anita Kar, Suresh Ghosh

Main category: cs.LG

TL;DR: FedGen-Edge enables efficient federated learning for large generative models by decoupling frozen pre-trained backbones from lightweight client adapters, using LoRA to reduce communication by 99% while maintaining performance.

DetailsMotivation: Large generative models are hard to train in federated settings due to heavy computation, communication overhead, and statistical/system heterogeneity across edge devices.

Method: Proposes FedGen-Edge framework that decouples frozen pre-trained global backbone from lightweight client-side adapters using Low-Rank Adaptation (LoRA), federating only the adapters via FedAvg-style aggregation.

Result: Achieves lower perplexity/FID and faster convergence than baselines on language modeling (PTB) and image generation (CIFAR-10), with >99% reduction in uplink traffic and stable aggregation under non-IID data.

Conclusion: FedGen-Edge provides a practical solution for privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices through efficient adapter-based federated learning.

Abstract: Large generative models (for example, language and diffusion models) enable high-quality text and image synthesis but are hard to train or adapt in cross-device federated settings due to heavy computation and communication and statistical/system heterogeneity. We propose FedGen-Edge, a framework that decouples a frozen, pre-trained global backbone from lightweight client-side adapters and federates only the adapters. Using Low-Rank Adaptation (LoRA) constrains client updates to a compact subspace, which reduces uplink traffic by more than 99 percent versus full-model FedAvg, stabilizes aggregation under non-IID data, and naturally supports personalization because each client can keep a locally tuned adapter. On language modeling (PTB) and image generation (CIFAR-10), FedGen-Edge achieves lower perplexity/FID and faster convergence than strong baselines while retaining a simple FedAvg-style server. A brief ablation shows diminishing returns beyond moderate LoRA rank and a trade-off between local epochs and client drift. FedGen-Edge offers a practical path toward privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices.

[797] WildfireGenome: Interpretable Machine Learning Reveals Local Drivers of Wildfire Risk and Their Cross-County Variation

Chenyue Liu, Ali Mostafavi

Main category: cs.LG

TL;DR: WildfireGenome improves wildfire risk assessment by combining federal indicators into interpretable risk labels, using machine learning for local predictions, and providing transparent driver analysis at county scale.

DetailsMotivation: Current wildfire risk assessments use coarse maps and black-box models that lack interpretability at decision-making scales, limiting practical utility for local planning and management.

Method: Three-component approach: (1) fuse 7 federal wildfire indicators into PCA-based composite risk labels at H3 Level-8 resolution, (2) Random Forest classification for local risk prediction, (3) SHAP and ICE/PDP analyses for county-specific driver interpretation.

Result: Achieved 0.755-0.878 accuracy and Quadratic Weighted Kappa up to 0.951 across 7 diverse US counties, with principal components explaining 87-94% of variance. Transfer tests worked well between similar ecological regions but failed across dissimilar contexts.

Conclusion: WildfireGenome advances wildfire risk assessment from regional prediction to interpretable, decision-scale analytics that can guide vegetation management, zoning, and infrastructure planning, with needleleaf forest cover and elevation identified as key drivers.

Abstract: Current wildfire risk assessments rely on coarse hazard maps and opaque machine learning models that optimize regional accuracy while sacrificing interpretability at the decision scale. WildfireGenome addresses these gaps through three components: (1) fusion of seven federal wildfire indicators into a sign-aligned, PCA-based composite risk label at H3 Level-8 resolution; (2) Random Forest classification of local wildfire risk; and (3) SHAP and ICE/PDP analyses to expose county-specific nonlinear driver relationships. Across seven ecologically diverse U.S. counties, models achieve accuracies of 0.755-0.878 and Quadratic Weighted Kappa up to 0.951, with principal components explaining 87-94% of indicator variance. Transfer tests show reliable performance between ecologically similar regions but collapse across dissimilar contexts. Explanations consistently highlight needleleaf forest cover and elevation as dominant drivers, with risk rising sharply at 30-40% needleleaf coverage. WildfireGenome advances wildfire risk assessment from regional prediction to interpretable, decision-scale analytics that guide vegetation management, zoning, and infrastructure planning.

[798] Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

Guojian Zhan, Likun Wang, Pengcheng Wang, Feihong Zhang, Jingliang Duan, Masayoshi Tomizuka, Shengbo Eben Li

Main category: cs.LG

TL;DR: The paper proposes TECRL, a trajectory entropy-constrained RL framework that addresses non-stationary Q-value estimation and short-sighted entropy tuning in maximum entropy RL, leading to improved performance and stability.

DetailsMotivation: Current maximum entropy RL methods suffer from two bottlenecks: non-stationary Q-value estimation due to joint entropy injection and temperature updates, and short-sighted local entropy tuning that ignores cumulative entropy effects over time.

Method: Proposes TECRL framework with two separate Q-functions (reward and entropy), enabling clean value targets and trajectory entropy constraints. Develops DSAC-E algorithm by extending distributional soft actor-critic with three refinements.

Result: Empirical results on OpenAI Gym benchmark show DSAC-E achieves higher returns and better stability compared to existing methods.

Conclusion: The TECRL framework effectively addresses key limitations of maximum entropy RL through separate Q-learning and trajectory entropy constraints, leading to improved performance and stability in off-policy reinforcement learning.

Abstract: Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.

[799] Sound Logical Explanations for Mean Aggregation Graph Neural Networks

Matthew Morris, Ian Horrocks

Main category: cs.LG

TL;DR: This paper analyzes the explainability and expressivity of GNNs with mean aggregation and non-negative weights (MAGNNs), proving which monotonic rules are sound for them and providing a logical framework to explain their predictions.

DetailsMotivation: Despite the prevalence of GNNs using mean aggregation, there's a lack of explainability and expressivity results for them, motivating the need to understand what logical rules can soundly explain their predictions.

Method: The authors prove the precise class of monotonic rules that are sound for MAGNNs and provide a restricted fragment of first-order logic to explain any MAGNN prediction. They also conduct experiments on standard inductive benchmarks.

Result: Experiments show that restricting mean-aggregation GNNs to non-negative weights yields comparable or improved performance, sound rules are obtained in practice, insightful explanations can be generated, and the sound rules can expose issues in trained models.

Conclusion: The work establishes a theoretical foundation for explaining MAGNN predictions through sound logical rules and demonstrates practical benefits including improved model interpretability and performance.

Abstract: Graph neural networks (GNNs) are frequently used for knowledge graph completion. Their black-box nature has motivated work that uses sound logical rules to explain predictions and characterise their expressivity. However, despite the prevalence of GNNs that use mean as an aggregation function, explainability and expressivity results are lacking for them. We consider GNNs with mean aggregation and non-negative weights (MAGNNs), proving the precise class of monotonic rules that can be sound for them, as well as providing a restricted fragment of first-order logic to explain any MAGNN prediction. Our experiments show that restricting mean-aggregation GNNs to have non-negative weights yields comparable or improved performance on standard inductive benchmarks, that sound rules are obtained in practice, that insightful explanations can be generated in practice, and that the sound rules can expose issues in the trained models.

[800] Aspiration-based Perturbed Learning Automata in Games with Noisy Utility Measurements. Part A: Stochastic Stability in Non-zero-Sum Games

Georgios C. Chasparis

Main category: cs.LG

TL;DR: APLA is a novel payoff-based learning scheme for distributed optimization that uses aspiration factors to ensure convergence to pure Nash equilibria in weakly-acyclic games, addressing limitations of standard reinforcement learning.

DetailsMotivation: Standard reinforcement learning schemes fail to guarantee convergence to pure Nash equilibria in distributed multi-player weakly-acyclic games, especially beyond the limited class of potential and coordination games.

Method: Aspiration-based perturbed learning automata (APLA) where each player’s action selection probability is reinforced by both repeated selection and an aspiration factor capturing satisfaction level, with stochastic stability analysis under noisy observations.

Result: Established equivalence between infinite-dimensional Markov chain and finite-dimensional one for stochastic stability in generic non-zero-sum games, and specialized convergence guarantees for weakly acyclic games.

Conclusion: APLA provides a robust distributed optimization framework that overcomes convergence limitations of standard reinforcement learning in multi-player games with noisy observations.

Abstract: Reinforcement-based learning has attracted considerable attention both in modeling human behavior as well as in engineering, for designing measurement- or payoff-based optimization schemes. Such learning schemes exhibit several advantages, especially in relation to filtering out noisy observations. However, they may exhibit several limitations when applied in a distributed setup. In multi-player weakly-acyclic games, and when each player applies an independent copy of the learning dynamics, convergence to (usually desirable) pure Nash equilibria cannot be guaranteed. Prior work has only focused on a small class of games, namely potential and coordination games. To address this main limitation, this paper introduces a novel payoff-based learning scheme for distributed optimization, namely aspiration-based perturbed learning automata (APLA). In this class of dynamics, and contrary to standard reinforcement-based learning schemes, each player’s probability distribution for selecting actions is reinforced both by repeated selection and an aspiration factor that captures the player’s satisfaction level. We provide a stochastic stability analysis of APLA in multi-player positive-utility games under the presence of noisy observations. This is the first part of the paper that characterizes stochastic stability in generic non-zero-sum games by establishing equivalence of the induced infinite-dimensional Markov chain with a finite dimensional one. In the second part, stochastic stability is further specialized to weakly acyclic games.

[801] Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach

Javier Marín

Main category: cs.LG

TL;DR: LGD models fail with traditional methods due to proxy data contamination, but information-theoretic approaches achieve better performance and reveal leverage is more important than size for recovery predictions.

DetailsMotivation: Address the fundamental data quality issue in LGD modeling where 90% of training data consists of proxy estimates rather than actual recovery outcomes, causing systematic failures in traditional methods.

Method: Used information-theoretic approaches based on Shannon entropy and mutual information to analyze 1,218 corporate bankruptcies (1980-2023), comparing them against traditional recursive partitioning methods like Random Forest.

Result: Information-theoretic methods achieved r-squared of 0.191 and RMSE of 0.284, while Random Forest performed worse than predicting the mean with negative r-squared (-0.664). Leverage-based features contained 1.510 bits of mutual information vs only 0.086 bits for size effects.

Conclusion: Information-theoretic approaches provide superior generalization for LGD modeling under data contamination, with practical guidance for Basel III compliance. Findings generalize to other domains with extended observation periods creating mixture structure in training data.

Abstract: Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.

[802] Enhancing failure prediction in nuclear industry: Hybridization of knowledge- and data-driven techniques

Amaratou Mahamadou Saley, Thierry Moyaux, Aïcha Sekhari, Vincent Cheutet, Jean-Baptiste Danielou

Main category: cs.LG

TL;DR: Proposes a hybrid predictive maintenance methodology combining data-driven techniques with nuclear domain knowledge, significantly outperforming purely data-driven approaches in failure prediction for nuclear equipment.

DetailsMotivation: The convergence of IoT and Industry 4.0 in nuclear industry requires precise maintenance prediction, but purely data-driven methods are limited due to system complexity and need for domain expertise.

Method: Combines data-driven techniques with domain knowledge from nuclear equipment, highlighting limitations of purely data-driven approaches and demonstrating how knowledge enhances predictive model performance.

Result: Hybrid approach achieves 24-hour prediction horizon with 93.12% F1 score, compared to purely data-driven methods’ 3-hour horizon and 56.36% F1 score.

Conclusion: The hybrid methodology significantly outperforms purely data-driven methods in nuclear equipment failure prediction, demonstrating the critical importance of integrating domain knowledge with data-driven approaches in complex, sensitive domains like nuclear industry.

Abstract: The convergence of the Internet of Things (IoT) and Industry 4.0 has significantly enhanced data-driven methodologies within the nuclear industry, notably enhancing safety and economic efficiency. This advancement challenges the precise prediction of future maintenance needs for assets, which is crucial for reducing downtime and operational costs. However, the effectiveness of data-driven methodologies in the nuclear sector requires extensive domain knowledge due to the complexity of the systems involved. Thus, this paper proposes a novel predictive maintenance methodology that combines data-driven techniques with domain knowledge from a nuclear equipment. The methodological originality of this paper is located on two levels: highlighting the limitations of purely data-driven approaches and demonstrating the importance of knowledge in enhancing the performance of the predictive models. The applicative novelty of this work lies in its use within a domain such as a nuclear industry, which is highly restricted and ultrasensitive due to security, economic and environmental concerns. A detailed real-world case study which compares the current state of equipment monitoring with two scenarios, demonstrate that the methodology significantly outperforms purely data-driven methods in failure prediction. While purely data-driven methods achieve only a modest performance with a prediction horizon limited to 3 h and a F1 score of 56.36%, the hybrid approach increases the prediction horizon to 24 h and achieves a higher F1 score of 93.12%.

[803] Convergence of Multiagent Learning Systems for Traffic control

Sayambhu Sen, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: Theoretical analysis proves convergence of multi-agent reinforcement learning for traffic signal control, extending single-agent convergence proofs to the cooperative multi-agent setting.

DetailsMotivation: Prior empirical work showed MARL's effectiveness for traffic control, but lacked rigorous theoretical analysis of stability and convergence properties.

Method: Used stochastic approximation methods to formally analyze learning dynamics, extending single-agent convergence proofs for asynchronous value iteration to the multi-agent case.

Result: Proved that the specific multi-agent reinforcement learning algorithm for traffic control converges under given conditions.

Conclusion: Bridged the theoretical gap by providing formal convergence proof for MARL in traffic signal control, establishing its theoretical foundation.

Abstract: Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

[804] Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

Guoqing Ma, Yuhan Zhang, Yuming Dai, Guangfu Hao, Yang Chen, Shan Yu

Main category: cs.LG

TL;DR: COWM layer improves RL efficiency by addressing non-stationarity through clustering and projection techniques, achieving 9-12.6% performance gains on benchmarks.

DetailsMotivation: RL agents struggle with non-stationary environments, requiring millions of iterations and resulting in low sample efficiency.

Method: Introduces Clustering Orthogonal Weight Modified (COWM) layer that integrates into policy networks, using clustering and projection matrix to stabilize learning.

Result: Achieves 9% improvement on vision-based and 12.6% on state-based DMControl benchmarks, outperforms state-of-the-art methods with robustness across algorithms.

Conclusion: COWM layer effectively mitigates non-stationarity in RL, improving learning speed and efficiency while reducing gradient interference.

Abstract: Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.

[805] Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

Alexis Roger, Gwen Legate, Kashif Rasul, Yuriy Nevmyvaka, Irina Rish

Main category: cs.LG

TL;DR: Tokenization design (scaling and quantization) affects model capacity and stability, while transfer learning impacts optimization efficiency. Well-designed tokenizers combined with pretraining work best, especially with small vocabularies.

DetailsMotivation: To systematically study how tokenizer design and transfer learning affect time series foundation models for forecasting, as these are critical components for state-of-the-art performance.

Method: Combination of empirical training experiments and theoretical analyses examining tokenizer configurations (scaling and quantization strategies) and pretraining vs random initialization.

Result: Tokenizer configuration governs representational capacity and stability, while transfer learning influences optimization efficiency. Pretrained models leverage well-designed tokenizers more effectively, especially at smaller vocabulary sizes. Misaligned tokenization can diminish pretraining benefits.

Conclusion: Careful tokenization is crucial in time series modeling, and combining small efficient vocabularies with pretrained weights is particularly advantageous in multi-modal forecasting settings where vocabulary must be shared across modalities.

Abstract: Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

[806] Early GVHD Prediction in Liver Transplantation via Multi-Modal Deep Learning on Imbalanced EHR Data

Yushan Jiang, Shuteng Niu, Dongjin Song, Yichen Wang, Jingna Feng, Xinyue Hu, Liu Yang, Cui Tao

Main category: cs.LG

TL;DR: Multi-modal deep learning framework for early prediction of graft-versus-host disease (GVHD) in liver transplantation using EHR data, achieving improved performance despite extreme class imbalance.

DetailsMotivation: GVHD is a rare but fatal complication in liver transplantation with high mortality. Early prediction enables timely intervention and improved patient outcomes.

Method: Multi-modal deep learning framework integrating four EHR modalities (demographics, lab tests, diagnoses, medications) with dynamic fusion, handling irregular records and extreme class imbalance through AUC-based optimization.

Result: Outperformed all baselines with AUC of 0.836, AUPRC of 0.157, recall of 0.768, and specificity of 0.803. Demonstrated effectiveness in capturing complementary information from different modalities.

Conclusion: The framework substantially improves early GVHD prediction by addressing heterogeneity and extreme class imbalance in real-world EHR data, showing promising results for clinical application.

Abstract: Graft-versus-host disease (GVHD) is a rare but often fatal complication in liver transplantation, with a very high mortality rate. By harnessing multi-modal deep learning methods to integrate heterogeneous and imbalanced electronic health records (EHR), we aim to advance early prediction of GVHD, paving the way for timely intervention and improved patient outcomes. In this study, we analyzed pre-transplant electronic health records (EHR) spanning the period before surgery for 2,100 liver transplantation patients, including 42 cases of graft-versus-host disease (GVHD), from a cohort treated at Mayo Clinic between 1992 and 2025. The dataset comprised four major modalities: patient demographics, laboratory tests, diagnoses, and medications. We developed a multi-modal deep learning framework that dynamically fuses these modalities, handles irregular records with missing values, and addresses extreme class imbalance through AUC-based optimization. The developed framework outperforms all single-modal and multi-modal machine learning baselines, achieving an AUC of 0.836, an AUPRC of 0.157, a recall of 0.768, and a specificity of 0.803. It also demonstrates the effectiveness of our approach in capturing complementary information from different modalities, leading to improved performance. Our multi-modal deep learning framework substantially improves existing approaches for early GVHD prediction. By effectively addressing the challenges of heterogeneity and extreme class imbalance in real-world EHR, it achieves accurate early prediction. Our proposed multi-modal deep learning method demonstrates promising results for early prediction of a GVHD in liver transplantation, despite the challenge of extremely imbalanced EHR data.

[807] MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

Mohammad Karami, Mohammad Reza Nemati, Aidin Kazemi, Ali Mikaeili Barzili, Hamid Azadegan, Behzad Moshiri

Main category: cs.LG

TL;DR: MedFedPure is a personalized federated learning defense framework that protects AI models for brain tumor detection from adversarial attacks during inference, combining personalized FL, masked autoencoder detection, and adaptive diffusion purification.

DetailsMotivation: AI models in federated medical settings are vulnerable to adversarial attacks that can cause misdiagnoses, and existing defenses struggle with decentralized data while privacy must be preserved.

Method: Combines personalized FL for institution-specific adaptation, Masked Autoencoder for detecting suspicious inputs, and adaptive diffusion-based purification to clean only flagged scans before classification.

Result: Significantly improved adversarial robustness from 49.50% to 87.33% under strong attacks while maintaining 97.67% clean accuracy on Br35H brain MRI dataset.

Conclusion: MedFedPure provides practical, real-time protection for deploying secure and privacy-preserving AI tools in clinical workflows without compromising accuracy.

Abstract: Artificial intelligence (AI) has shown great potential in medical imaging, particularly for brain tumor detection using Magnetic Resonance Imaging (MRI). However, the models remain vulnerable at inference time when they are trained collaboratively through Federated Learning (FL), an approach adopted to protect patient privacy. Adversarial attacks can subtly alter medical scans in ways invisible to the human eye yet powerful enough to mislead AI models, potentially causing serious misdiagnoses. Existing defenses often assume centralized data and struggle to cope with the decentralized and diverse nature of federated medical settings. In this work, we present MedFedPure, a personalized federated learning defense framework designed to protect diagnostic AI models at inference time without compromising privacy or accuracy. MedFedPure combines three key elements: (1) a personalized FL model that adapts to the unique data distribution of each institution; (2) a Masked Autoencoder (MAE) that detects suspicious inputs by exposing hidden perturbations; and (3) an adaptive diffusion-based purification module that selectively cleans only the flagged scans before classification. Together, these steps offer robust protection while preserving the integrity of normal, benign images. We evaluated MedFedPure on the Br35H brain MRI dataset. The results show a significant gain in adversarial robustness, improving performance from 49.50% to 87.33% under strong attacks, while maintaining a high clean accuracy of 97.67%. By operating locally and in real time during diagnosis, our framework provides a practical path to deploying secure, trustworthy, and privacy-preserving AI tools in clinical workflows. Index Terms: cancer, tumor detection, federated learning, masked autoencoder, diffusion, privacy

[808] On the Fundamental Limits of LLMs at Scale

Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Zeeshan Memon, Muhammad Ibtsaam Qadir, Sagnik Bhattacharya, Hassan Rizwan, Abhiram R. Gorle, Maahe Zehra Kazmi, Ayesha Mohsin, Muhammad Usman Rafique, Zihao He, Pulkit Mehta, Muhammad Ali Jamshed, John M. Cioffi

Main category: cs.LG

TL;DR: This paper presents a unified theoretical framework that identifies five fundamental limitations of LLM scaling: hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment, connecting them to foundational computational and information-theoretic constraints.

DetailsMotivation: Existing surveys describe LLM scaling limitations empirically but lack rigorous theoretical synthesis connecting them to fundamental limits of computation, information, and learning. This work aims to close that gap by providing a proof-informed framework.

Method: The authors develop a unified theoretical framework using computability theory, information theory, and statistical learning theory. They employ diagonalization arguments, analyze undecidable queries, examine finite description length constraints, and study geometric effects in attention mechanisms.

Result: The framework demonstrates irreducible error due to computability limits, information-theoretic bounds on accuracy, context compression effects, and inherent limitations in reasoning, retrieval, and multimodal alignment. It provides theorems and empirical evidence showing where scaling helps, saturates, or cannot progress.

Conclusion: LLM scaling faces fundamental theoretical ceilings rooted in computation, information, and learning theory. The paper outlines practical mitigation strategies like bounded-oracle retrieval, positional curricula, and sparse attention, while establishing theoretical foundations for understanding scaling limitations.

Abstract: Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.

[809] SA-EMO: Structure-Aligned Encoder Mixture of Operators for Generalizable Full-waveform Inversion

Wang Zhenyu, Li Peiyuan, Shi Yongxiang, Wu Ruoyu, Zhang Lei

Main category: cs.LG

TL;DR: SA-EMO architecture combines structure-aligned encoding with multiple neural operators for improved velocity-field inversion in unknown geological settings.

DetailsMotivation: Traditional single CNN/operator methods struggle with generalization in complex geology and fail to distinguish diverse geological types, limiting FWI effectiveness.

Method: Structure-aligned encoder maps wavefields to latent space, then adaptive routing selects/fuses multiple neural operators (spectral, wavelet, multiscale, local) for velocity prediction.

Result: 58.443% MAE reduction and 10.308% boundary resolution improvement over traditional methods on OpenFWI and Marmousi2 benchmarks.

Conclusion: SA-EMO provides efficient, scalable, and physically interpretable full-waveform inversion paradigm with significant performance gains.

Abstract: Full-waveform inversion (FWI) can produce high-resolution subsurface models, yet it remains inherently ill-posed, highly nonlinear, and computationally intensive. Although recent deep learning and numerical acceleration methods have improved speed and scalability, they often rely on single CNN architectures or single neural operators, which struggle to generalize in unknown or complex geological settings and are ineffective at distinguishing diverse geological types. To address these issues, we propose a Structure-Aligned Encoder-Mixture-of-Operators (SA-EMO) architecture for velocity-field inversion under unknown subsurface structures. First, a structure-aligned encoder maps high-dimensional seismic wavefields into a physically consistent latent space, thereby eliminating spatio-temporal mismatch between the waveform and velocity domains, recovering high-frequency components, and enhancing feature generalization. Then, an adaptive routing mechanism selects and fuses multiple neural-operator experts, including spectral, wavelet, multiscale, and local operators, to predict the velocity model. We systematically evaluate our approach on the OpenFWI benchmark and the Marmousi2 dataset. Results show that SA-EMO significantly outperforms traditional CNN or single-operator methods, achieving an average MAE reduction of approximately 58.443% and an improvement in boundary resolution of about 10.308%. Ablation studies further reveal that the structure-aligned encoder, the expert-fusion mechanism, and the routing module each contribute markedly to the performance gains. This work introduces a new paradigm for efficient, scalable, and physically interpretable full-waveform inversion.

[810] Transformer-Based Scalable Multi-Agent Reinforcement Learning for Networked Systems with Long-Range Interactions

Vidur Sinha, Muhammed Ustaomeroglu, Guannan Qu

Main category: cs.LG

TL;DR: STACCA is a transformer-based MARL framework that addresses long-range dependency modeling and network generalization challenges in large-scale network control tasks.

DetailsMotivation: Existing MARL methods struggle with capturing long-range dependencies (like cascading failures) and lack generalizability across different network topologies, requiring retraining for each new graph.

Method: Uses a centralized Graph Transformer Critic for long-range dependency modeling and system-level feedback, a shared Graph Transformer Actor for generalizable policies across networks, and a novel counterfactual advantage estimator for improved credit assignment.

Result: Demonstrates improved performance, network generalization, and scalability on epidemic containment and rumor-spreading network control tasks compared to existing methods.

Conclusion: Transformer-based MARL architectures like STACCA show strong potential for achieving scalable and generalizable control in large-scale networked systems.

Abstract: Multi-agent reinforcement learning (MARL) has shown promise for large-scale network control, yet existing methods face two major limitations. First, they typically rely on assumptions leading to decay properties of local agent interactions, limiting their ability to capture long-range dependencies such as cascading power failures or epidemic outbreaks. Second, most approaches lack generalizability across network topologies, requiring retraining when applied to new graphs. We introduce STACCA (Shared Transformer Actor-Critic with Counterfactual Advantage), a unified transformer-based MARL framework that addresses both challenges. STACCA employs a centralized Graph Transformer Critic to model long-range dependencies and provide system-level feedback, while its shared Graph Transformer Actor learns a generalizable policy capable of adapting across diverse network structures. Further, to improve credit assignment during training, STACCA integrates a novel counterfactual advantage estimator that is compatible with state-value critic estimates. We evaluate STACCA on epidemic containment and rumor-spreading network control tasks, demonstrating improved performance, network generalization, and scalability. These results highlight the potential of transformer-based MARL architectures to achieve scalable and generalizable control in large-scale networked systems.

[811] Global Feature Enhancing and Fusion Framework for Strain Gauge Time Series Classification

Xu Zhang, Peng Wang, Chen Wang, Zhe Xu, Xiaohua Nie, Wei Wang

Main category: cs.LG

TL;DR: Proposes a hypergraph-based framework to learn and fuse global features for strain gauge status recognition, addressing CNN limitations in capturing global patterns when local subsequences are similar.

DetailsMotivation: Current CNN-based time series classification methods focus on local features but struggle with global feature extraction, which is crucial when local subsequences between different time series are very similar, as in aircraft wing strain gauge data.

Method: A hypergraph-based global feature learning and fusion framework that constructs global features through feature engineering and learns high-order relationships between local features to capture global patterns for enhanced time series representation.

Result: The method shows better generalization for unseen data in strain gauge status recognition and performs well on both industrial SGS and public UCR datasets.

Conclusion: The proposed hypergraph framework effectively learns and fuses global features to improve strain gauge status recognition accuracy, addressing the limitations of CNN-based approaches in capturing comprehensive time series representations.

Abstract: Strain Gauge Status (SGS) recognition is crucial in the field of intelligent manufacturing based on the Internet of Things, as accurate identification helps timely detection of failed mechanical components, avoiding accidents. The loading and unloading sequences generated by strain gauges can be identified through time series classification (TSC) algorithms. Recently, deep learning models, e.g., convolutional neural networks (CNNs) have shown remarkable success in the TSC task, as they can extract discriminative local features from the subsequences to identify the time series. However, we observe that only the local features may not be sufficient for expressing the time series, especially when the local sub-sequences between different time series are very similar, e.g., SGS data of aircraft wings in static strength experiments. Nevertheless, CNNs suffer from the limitation in extracting global features due to the nature of convolution operations. For extracting global features to more comprehensively represent the SGS time series, we propose two insights: (i) Constructing global features through feature engineering. (ii) Learning high-order relationships between local features to capture global features. To realize and utilize them, we propose a hypergraph-based global feature learning and fusion framework, which learns and fuses global features for semantic consistency to enhance the representation of SGS time series, thereby improving recognition accuracy. Our method designs are validated on industrial SGS and public UCR datasets, showing better generalization for unseen data in SGS recognition.

[812] Predicting Grain Growth in Polycrystalline Materials Using Deep Learning Time Series Models

Eliane Younes, Elie Hachem, Marc Bernacki

Main category: cs.LG

TL;DR: Deep learning models (RNN, LSTM, TCN, transformers) were evaluated for predicting grain size distributions during grain growth. LSTM achieved highest accuracy (>90%) and stability, reducing computation time from 20 minutes to seconds.

DetailsMotivation: Grain growth strongly influences mechanical behavior of materials, making prediction crucial for microstructural engineering. Full-field simulations are computationally demanding, so efficient alternatives are needed.

Method: Used mean-field statistical descriptors from high-fidelity simulations. Processed 120 grain growth sequences into normalized grain size distributions over time. Trained models to predict future distributions from short temporal history using recursive forecasting strategy.

Result: LSTM network achieved highest accuracy (above 90%) and most stable performance, maintaining physically consistent predictions over extended horizons. Reduced computation time from about 20 minutes per sequence to only a few seconds. Other architectures tended to diverge when forecasting further in time.

Conclusion: LSTM-based forecasting with low-dimensional descriptors shows potential for efficient and accurate microstructure prediction, with implications for digital twin development and process optimization.

Abstract: Grain Growth strongly influences the mechanical behavior of materials, making its prediction a key objective in microstructural engineering. In this study, several deep learning approaches were evaluated, including recurrent neural networks (RNN), long short-term memory (LSTM), temporal convolutional networks (TCN), and transformers, to forecast grain size distributions during grain growth. Unlike full-field simulations, which are computationally demanding, the present work relies on mean-field statistical descriptors extracted from high-fidelity simulations. A dataset of 120 grain growth sequences was processed into normalized grain size distributions as a function of time. The models were trained to predict future distributions from a short temporal history using a recursive forecasting strategy. Among the tested models, the LSTM network achieved the highest accuracy (above 90%) and the most stable performance, maintaining physically consistent predictions over extended horizons while reducing computation time from about 20 minutes per sequence to only a few seconds, whereas the other architectures tended to diverge when forecasting further in time. These results highlight the potential of low-dimensional descriptors and LSTM-based forecasting for efficient and accurate microstructure prediction, with direct implications for digital twin development and process optimization.

[813] KForge: Program Synthesis for Diverse AI Hardware Accelerators

Taras Sereda, Tom St. John, Burak Bartan, Natalie Serrino, Sachin Katti, Zain Asgar

Main category: cs.LG

TL;DR: KForge is a platform-agnostic framework that uses two collaborative LLM agents to optimize GPU kernels across diverse accelerators through iterative refinement and performance analysis.

DetailsMotivation: GPU kernels are essential for ML performance but challenging to optimize across different hardware accelerators, requiring a flexible approach that can adapt to various platforms.

Method: Uses two LLM-based agents: a generation agent that produces and refines programs through compilation/correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. Requires only single-shot examples for new platforms.

Result: Demonstrated effective cross-platform knowledge transfer where reference implementations from one architecture improve generation quality for different hardware, and validated platform-agnostic synthesis across NVIDIA CUDA and Apple Metal.

Conclusion: The agent-based architecture enables platform-agnostic GPU kernel optimization with minimal platform-specific examples, successfully bridging fundamentally different parallel computing platforms.

Abstract: GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.

[814] Toward Better Generalization in Few-Shot Learning through the Meta-Component Combination

Qiuhao Zeng

Main category: cs.LG

TL;DR: The paper proposes a meta-learning method that learns classifiers as combinations of diverse meta-components to improve generalization in few-shot learning.

DetailsMotivation: Metric-based meta-learning for few-shot learning often overfits to seen classes and fails to generalize well to unseen classes due to dependency on deep metrics learned from limited seen class data.

Method: The method learns classifiers as combinations of meta-components across meta-learning episodes, using orthogonal regularization to disentangle and diversify meta-components to capture various shared substructures among classifiers.

Result: Extensive experiments on few-shot benchmark tasks demonstrate superior performance compared to existing methods.

Conclusion: The proposed meta-learning approach with disentangled meta-components effectively improves generalization in few-shot learning by capturing diverse shared substructures across classifiers.

Abstract: In few-shot learning, classifiers are expected to generalize to unseen classes given only a small number of instances of each new class. One of the popular solutions to few-shot learning is metric-based meta-learning. However, it highly depends on the deep metric learned on seen classes, which may overfit to seen classes and fail to generalize well on unseen classes. To improve the generalization, we explore the substructures of classifiers and propose a novel meta-learning algorithm to learn each classifier as a combination of meta-components. Meta-components are learned across meta-learning episodes on seen classes and disentangled by imposing an orthogonal regularizer to promote its diversity and capture various shared substructures among different classifiers. Extensive experiments on few-shot benchmark tasks show superior performances of the proposed method.

[815] An Explainable and Fair AI Tool for PCOS Risk Assessment: Calibration, Subgroup Equity, and Interactive Clinical Deployment

Asma Sadia Khan, Sadia Tabassum

Main category: cs.LG

TL;DR: A fairness-audited ML framework for PCOS prediction that balances accuracy, calibration, and interpretability while identifying age-related diagnostic disparities.

DetailsMotivation: To develop a fair and interpretable ML system for PCOS diagnosis that addresses diagnostic disparities across patient subgroups and bridges the gap between AI research and clinical usability.

Method: Integrated SHAP-based feature attributions with demographic audits, trained Random Forest, SVM, and XGBoost models with isotonic and Platt scaling for calibration, and used probabilistic calibration metrics (Brier Score, ECE) for fairness evaluation.

Result: Calibrated Random Forest achieved 90.8% accuracy, identified key features (follicle count, weight gain, menstrual irregularity), revealed age-related disparities (90.9% accuracy for 25-35 vs 69.2% for under 25), and maintained robustness across phenotypes with perfect precision in obese women.

Conclusion: The framework successfully balances predictive performance, calibration, and interpretability while identifying actionable disparities, with a web interface enabling clinical translation of AI research for PCOS diagnosis.

Abstract: This paper presents a fairness-audited and interpretable machine learning framework for predicting polycystic ovary syndrome (PCOS), designed to evaluate model performance and identify diagnostic disparities across patient subgroups. The framework integrated SHAP-based feature attributions with demographic audits to connect predictive explanations with observed disparities for actionable insights. Probabilistic calibration metrics (Brier Score and Expected Calibration Error) are incorporated to ensure reliable risk predictions across subgroups. Random Forest, SVM, and XGBoost models were trained with isotonic and Platt scaling for calibration and fairness comparison. A calibrated Random Forest achieved a high predictive accuracy of 90.8%. SHAP analysis identified follicle count, weight gain, and menstrual irregularity as the most influential features, which are consistent with the Rotterdam diagnostic criteria. Although the SVM with isotonic calibration achieved the lowest calibration error (ECE = 0.0541), the Random Forest model provided a better balance between calibration and interpretability (Brier = 0.0678, ECE = 0.0666). Therefore, it was selected for detailed fairness and SHAP analyses. Subgroup analysis revealed that the model performed best among women aged 25-35 (accuracy 90.9%) but underperformed in those under 25 (69.2%), highlighting age-related disparities. The model achieved perfect precision in obese women and maintained high recall in lean PCOS cases, demonstrating robustness across phenotypes. Finally, a Streamlit-based web interface enables real-time PCOS risk assessment, Rotterdam criteria evaluation, and interactive ‘what-if’ analysis, bridging the gap between AI research and clinical usability.

[816] Enhancing PINN Accuracy for the RLW Equation: Adaptive and Conservative Approaches

Aamir Shehzad

Main category: cs.LG

TL;DR: Improved PINN approaches (adaptive and conservative) were developed for solving the regularized long wave equation, showing that effectiveness is problem-specific and conservation enforcement may harm performance in highly nonlinear systems.

DetailsMotivation: Standard PINNs produce large errors for the regularized long wave equation, requiring improved approaches to handle different types of problems effectively.

Method: Developed two PINN approaches: adaptive with self-adaptive loss weighting and conservative enforcing explicit conservation laws. Tested on three benchmarks: single soliton propagation, two-soliton interaction, and undular bore evolution.

Result: Adaptive PINN performed better for complex nonlinear interactions (colliding solitons), while conservative PINN was better for long-term behavior (single solitons and undular bores). Results were within O(10^-5) of established numerical solutions.

Conclusion: PINN effectiveness is problem-specific; conservation enforcement may be harmful for highly nonlinear systems. PINNs can provide accurate mesh-free solutions to complex PDEs, challenging assumptions about conservation enforcement always improving performance.

Abstract: Standard physics-informed neural network implementations have produced large error rates when using these models to solve the regularized long wave (RLW) equation. Two improved PINN approaches were developed in this research: an adaptive approach with self-adaptive loss weighting and a conservative approach enforcing explicit conservation laws. Three benchmark tests were used to demonstrate how effective PINN’s are as they relate to the type of problem being solved (i.e., time dependent RLW equation). The first was a single soliton traveling along a line (propagation), the second was the interaction between two solitons, and the third was the evolution of an undular bore over the course of $t=250$. The results demonstrated that the effectiveness of PINNs are problem specific. The adaptive PINN was significantly better than both the conservative PINN and the standard PINN at solving problems involving complex nonlinear interactions such as colliding two solitons. The conservative approach was significantly better at solving problems involving long term behavior of single solitons and undular bores. However, the most important finding from this research is that explicitly enforcing conservation laws may be harmful to optimizing the solution of highly nonlinear systems of equations and therefore requires special training methods. The results from our adaptive and conservative approaches were within $O(10^{-5})$ of established numerical solutions for the same problem, thus demonstrating that PINNs can provide accurate solutions to complex systems of partial differential equations without the need for a discretization of space or time (mesh free). Moreover, the finding from this research challenges the assumptions that conservation enforcement will always improve the performance of a PINN and provides researchers with guidelines for designing PINNs for use on specific types of problems.

[817] EcoSpa: Efficient Transformer Training with Coupled Sparsity

Jinqi Xiao, Cheng Luo, Lingyi Huang, Cheng Yang, Yang Sui, Huy Phan, Xiao Zang, Yibiao Ying, Zhexiang Tang, Anima Anandkumar, Bo Yuan

Main category: cs.LG

TL;DR: EcoSpa is a structured sparse training method that jointly sparsifies coupled weight matrix pairs in transformers, preserving interaction patterns to enable efficient training with 50% memory reduction, 21% faster training, and 1.6x inference speedup.

DetailsMotivation: Transformers have high computational demands, and existing sparse training methods fail to preserve critical structural relationships between weight matrices that interact multiplicatively, leading to performance degradation at high sparsity levels.

Method: Jointly evaluates and sparsifies coupled weight matrix pairs through aligned row/column removal, introduces new granularity for calibrating structural component importance, and performs coupled estimation and sparsification across pre-training and fine-tuning scenarios.

Result: Enables efficient training of LLaMA-1B with 50% memory reduction and 21% faster training, achieves 2.2x model compression on GPT-2-Medium with 2.4 lower perplexity, and delivers 1.6x inference speedup.

Conclusion: EcoSpa provides an efficient transformer training approach using standard PyTorch operations without requiring custom hardware or kernels, making efficient transformer training accessible on commodity hardware.

Abstract: Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges. While sparse training offers efficiency gains, existing methods fail to preserve critical structural relationships between weight matrices that interact multiplicatively in attention and feed-forward layers. This oversight leads to performance degradation at high sparsity levels. We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs, preserving their interaction patterns through aligned row/column removal. EcoSpa introduces a new granularity for calibrating structural component importance and performs coupled estimation and sparsification across both pre-training and fine-tuning scenarios. Evaluations demonstrate substantial improvements: EcoSpa enables efficient training of LLaMA-1B with 50% memory reduction and 21% faster training, achieves $2.2\times$ model compression on GPT-2-Medium with $2.4$ lower perplexity, and delivers $1.6\times$ inference speedup. The approach uses standard PyTorch operations, requiring no custom hardware or kernels, making efficient transformer training accessible on commodity hardware.

[818] A Deep Learning Model to Predicting Changes in Consumer Attributes for New Line-extended Products

Li Yinxing, Tsukasa Ishigaki

Main category: cs.LG

TL;DR: A deep learning model (CTVAE) predicts consumer attribute changes for new product line extensions using synthetic data generation from tabular consumer-product data.

DetailsMotivation: To help marketers understand key consumer attributes for new line-extended products before market entry, avoiding brand image disruption from excessive extensions.

Method: Proposed Conditional Tabular Variational Auto-Encoder (CTVAE) generates synthetic data from large-scale tabular consumer and product data to predict consumer attribute changes.

Result: CTVAE demonstrates superior prediction performance compared to existing models and provides implications for effective product line marketing strategies.

Conclusion: The approach helps avoid cannibalization and supports designing product images and marketing strategies for line extensions.

Abstract: Product line extension is a marketing strategy that enhances a company’s sphere of influence. Because excessive line extensions disrupt brand image, only appropriate line extensions based on consumer needs are desirable. Marketers should know the key consumer attributes of the primary customers for new line-extended products before companies enter the market. This paper describes a method for predicting changes in consumer attributes for new line-extended products using a novel deep learning model. The proposed model, Conditional Tabular Variational Auto-Encoder (CTVAE), generates synthetic data from large-scale tabular data of consumers and products. It can provide various implications about effective product line marketing for marketers. The experimental results demonstrate that the CTVAE offers superior prediction performance than existing models. We indicate implications for new products that change containers or flavors for effective product line marketing. The proposed approach has the potential to contribute to avoiding cannibalization and to designing product images and marketing strategies.

[819] Environment-Aware Transfer Reinforcement Learning for Sustainable Beam Selection

Dariush Salami, Ramin Hashemi, Parham Kazemi, Mikko A. Uusitalo

Main category: cs.LG

TL;DR: This paper proposes a transfer learning approach using point cloud modeling and Chamfer distance to identify structurally similar environments, enabling reuse of pre-trained RL models for beam selection in 5G networks, achieving 16x reduction in training time and computational overhead.

DetailsMotivation: Traditional RL-based beam selection requires extensive training time and computational resources in diverse environments, posing challenges for scalability and energy efficiency in 5G networks.

Method: Model environment as point clouds representing gNodeBs and scatterers, compute Chamfer distance to identify similar environments, and apply transfer learning to reuse pre-trained RL models.

Result: 16x reduction in training time and computational overhead while maintaining high performance, contributing to energy efficiency and reduced carbon emissions.

Conclusion: The approach enables scalable, adaptive, and environmentally conscious RL-based beam selection strategies, supporting green AI deployment in wireless systems with faster time-to-deployment.

Abstract: This paper presents a novel and sustainable approach for improving beam selection in 5G and beyond networks using transfer learning and Reinforcement Learning (RL). Traditional RL-based beam selection models require extensive training time and computational resources, particularly when deployed in diverse environments with varying propagation characteristics posing a major challenge for scalability and energy efficiency. To address this, we propose modeling the environment as a point cloud, where each point represents the locations of gNodeBs (gNBs) and surrounding scatterers. By computing the Chamfer distance between point clouds, structurally similar environments can be efficiently identified, enabling the reuse of pre-trained models through transfer learning. This methodology leads to a 16x reduction in training time and computational overhead, directly contributing to energy efficiency. By minimizing the need for retraining in each new deployment, our approach significantly lowers power consumption and supports the development of green and sustainable Artificial Intelligence (AI) in wireless systems. Furthermore, it accelerates time-to-deployment, reduces carbon emissions associated with training, and enhances the viability of deploying AI-driven communication systems at the edge. Simulation results confirm that our approach maintains high performance while drastically cutting energy costs, demonstrating the potential of transfer learning to enable scalable, adaptive, and environmentally conscious RL-based beam selection strategies in dynamic and diverse propagation environments.

[820] Lightweight Time Series Data Valuation on Time Series Foundation Models via In-Context Finetuning

Shunyu Wu, Tianyue Li, Yixuan Leng, Jingyi Suo, Jian Lou, Dan Li, See-Kiong Ng

Main category: cs.LG

TL;DR: LTSV is a lightweight method for valuing time series data in foundation models using in-context finetuning to approximate influence functions while capturing temporal dependencies.

DetailsMotivation: Traditional data valuation methods face computational bottlenecks with large time series foundation models and fail to preserve temporal dependencies, making accurate and efficient data valuation challenging.

Method: LTSV estimates sample contributions by measuring context loss changes after in-context finetuning, uses temporal block aggregation to capture dependencies across overlapping time windows, and leverages TSFM generalization capabilities.

Result: Experiments across multiple datasets and models show LTSV provides reliable valuation performance while maintaining manageable computational requirements.

Conclusion: In-context finetuning on time series foundation models offers a practical bridge between data attribution and model generalization in time series learning.

Abstract: Time series foundation models (TSFMs) have demonstrated increasing capabilities due to their extensive pretraining on large volumes of diverse time series data. Consequently, the quality of time series data is crucial to TSFM performance, rendering an accurate and efficient data valuation of time series for TSFMs indispensable. However, traditional data valuation methods, such as influence functions, face severe computational bottlenecks due to their poor scalability with growing TSFM model sizes and often fail to preserve temporal dependencies. In this paper, we propose LTSV, a Lightweight Time Series Valuation on TSFMS via in-context finetuning. Grounded in the theoretical evidence that in-context finetuning approximates the influence function, LTSV estimates a sample’s contribution by measuring the change in context loss after in-context finetuning, leveraging the strong generalization capabilities of TSFMs to produce robust and transferable data valuations. To capture temporal dependencies, we introduce temporal block aggregation, which integrates per-block influence scores across overlapping time windows. Experiments across multiple time series datasets and models demonstrate that LTSV consistently provides reliable and strong valuation performance, while maintaining manageable computational requirements. Our results suggest that in-context finetuning on time series foundation models provides a practical and effective bridge between data attribution and model generalization in time series learning.

[821] Enhanced Water Leak Detection with Convolutional Neural Networks and One-Class Support Vector Machine

Daniele Ugo Leonzio, Paolo Bestagini, Marco Marcon, Stefano Tubaro

Main category: cs.LG

TL;DR: A data-driven leak detection method using water pressure measurements and one-class SVM trained on no-leak data, outperforming recent methods on simulated WDNs.

DetailsMotivation: Significant water loss occurs annually due to leaks in Water Distribution Networks (WDNs), highlighting the need for reliable leak detection and localization systems.

Method: Uses water pressure measurements from WDN nodes, a feature extractor, and one-class SVM trained exclusively on no-leak data to detect leaks as anomalies. Requires only WDN topology and pressure data from leak-free conditions.

Result: The proposed solution outperforms recent leak detection methods when tested on simulated datasets using the Modena WDN.

Conclusion: The data-driven approach using one-class SVM on pressure measurements provides an effective leak detection method that requires only topology information and no-leak baseline data.

Abstract: Water is a critical resource that must be managed efficiently. However, a substantial amount of water is lost each year due to leaks in Water Distribution Networks (WDNs). This underscores the need for reliable and effective leak detection and localization systems. In recent years, various solutions have been proposed, with data-driven approaches gaining increasing attention due to their superior performance. In this paper, we propose a new method for leak detection. The method is based on water pressure measurements acquired at a series of nodes of a WDN. Our technique is a fully data-driven solution that makes only use of the knowledge of the WDN topology, and a series of pressure data acquisitions obtained in absence of leaks. The proposed solution is based on an feature extractor and a one-class Support Vector Machines (SVM) trained on no-leak data, so that leaks are detected as anomalies. The results achieved on a simulate dataset using the Modena WDN demonstrate that the proposed solution outperforms recent methods for leak detection.

[822] Incomplete Depression Feature Selection with Missing EEG Channels

Zhijian Gong, Wenjia Dong, Xueyuan Xu, Fulin Wei, Chunyu Liu, Li Zhuo

Main category: cs.LG

TL;DR: Proposes IDFS-MEC, a feature selection method for EEG-based depression analysis that handles missing channels and reduces feature redundancy to improve detection accuracy.

DetailsMotivation: EEG features for depression analysis often contain redundant, irrelevant, and noisy information, and real-world EEG data frequently suffers from missing channels due to electrode detachment and noise interference.

Method: IDFS-MEC integrates missing-channel indicator information and adaptive channel weighting into orthogonal regression to handle incomplete channels, and uses global redundancy minimization to reduce feature redundancy in selected subsets.

Result: Extensive experiments on MODMA and PRED-d003 datasets show IDFS-MEC outperforms 10 popular feature selection methods across 3-, 64-, and 128-channel settings.

Conclusion: The proposed IDFS-MEC method effectively selects EEG feature subsets that are robust to missing channels and redundant information, achieving superior depression detection performance.

Abstract: As a critical mental health disorder, depression has severe effects on both human physical and mental well-being. Recent developments in EEG-based depression analysis have shown promise in improving depression detection accuracies. However, EEG features often contain redundant, irrelevant, and noisy information. Additionally, real-world EEG data acquisition frequently faces challenges, such as data loss from electrode detachment and heavy noise interference. To tackle the challenges, we propose a novel feature selection approach for robust depression analysis, called Incomplete Depression Feature Selection with Missing EEG Channels (IDFS-MEC). IDFS-MEC integrates missing-channel indicator information and adaptive channel weighting learning into orthogonal regression to lessen the effects of incomplete channels on model construction, and then utilizes global redundancy minimization learning to reduce redundant information among selected feature subsets. Extensive experiments conducted on MODMA and PRED-d003 datasets reveal that the EEG feature subsets chosen by IDFS-MEC have superior performance than 10 popular feature selection methods among 3-, 64-, and 128-channel settings.

[823] Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

Larkin Liu, Kashif Rasul, Yutong Chao, Jalal Etesami

Main category: cs.LG

TL;DR: A novel online learning framework for Stackelberg games using neural normalizing flows to map joint action spaces to smooth Riemannian manifolds, enabling efficient linear bandit algorithms and theoretical regret guarantees.

DetailsMotivation: To address the challenge of online learning in Stackelberg general-sum games by creating tractable learning spaces through manifold learning, overcoming the complexity of sequential turn-based interactions between leader and follower agents.

Method: Uses neural normalizing flows to learn a diffeomorphism mapping the joint action space to a smooth spherical Riemannian manifold (Stackelberg manifold), creating isoplanar subspaces that enable linear bandit algorithms for online learning.

Result: Established theoretical bounds for regret minimization on the learned manifold and demonstrated empirical effectiveness compared to standard baselines in domains like cybersecurity and economic supply chain optimization.

Conclusion: Integration of manifold learning with game theory reveals neural normalizing flows as a powerful tool for multi-agent learning, providing efficient online learning with theoretical guarantees in Stackelberg games.

Abstract: We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth spherical Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. Leveraging the linearity of the agents’ reward functions on the Stackelberg manifold, our construct allows the application of linear bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on the learned manifold and establish bounds on the simple regret for learning Stackelberg equilibrium. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi-agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization.

[824] How many stations are sufficient? Exploring the effect of urban weather station density reduction on imputation accuracy of air temperature and humidity

Marvin Plein, Carsten F. Dormann, Andreas Christen

Main category: cs.LG

TL;DR: A step-wise station removal procedure can substantially reduce urban weather station network density from 42 to 4 stations while maintaining high predictive accuracy for air temperature and humidity patterns.

DetailsMotivation: Maintaining urban weather station networks is expensive and labor-intensive, creating a need for more efficient resource allocation while preserving monitoring capabilities.

Method: Step-wise station removal procedure to thin an existing WSN in Freiburg, Germany, analyzing the ability of WSN subsets to reproduce air temperature and humidity patterns of the original network.

Result: Reduction from 42 to 4 stations increased mean prediction RMSEs only slightly (0.69K to 0.83K for temperature, 3.8% to 4.4% for humidity). Stations at urban-rural edges were most valuable for reconstructing city-wide climate.

Conclusion: Substantial reductions in station numbers are possible while retaining high predictive accuracy, demonstrating potential for efficient resource allocation in urban climate research.

Abstract: Urban weather station networks (WSNs) are widely used to monitor urban weather and climate patterns and aid urban planning. However, maintaining WSNs is expensive and labor-intensive. Here, we present a step-wise station removal procedure to thin an existing WSN in Freiburg, Germany, and analyze the ability of WSN subsets to reproduce air temperature and humidity patterns of the entire original WSN for a year following a simulated reduction of WSN density. We found that substantial reductions in station numbers after one year of full deployment are possible while retaining high predictive accuracy. A reduction from 42 to 4 stations, for instance, increased mean prediction RMSEs from 0.69 K to 0.83 K for air temperature and from 3.8% to 4.4% for relative humidity, corresponding to RMSE increases of only 20% and 16%, respectively. Predictive accuracy is worse for remote stations in forests than for stations in built-up or open settings, but consistently better than a state-of-the-art numerical urban land-surface model (Surface Urban Energy and Water Balance Scheme). Stations located at the edges between built-up and rural areas are most valuable when reconstructing city-wide climate characteristics. Our study demonstrates the potential of thinning WSNs to maximize the efficient allocation of financial and personnel-related resources in urban climate research.

[825] On the Probabilistic Learnability of Compact Neural Network Preimage Bounds

Luca Marzari, Manuele Bicego, Ferdinando Cicalese, Alessandro Farinelli

Main category: cs.LG

TL;DR: RF-ProVe is a probabilistic method using random forests to compute high-confidence preimage approximations for neural networks, addressing the scalability limitations of exact methods.

DetailsMotivation: Existing provable methods for computing neural network preimage bounds face fundamental scalability limitations due to the #P-hardness of the problem, necessitating alternative approaches.

Method: RF-ProVe uses ensemble of randomized decision trees to generate candidate input regions satisfying output properties, refined through active resampling, with formal statistical guarantees on purity and coverage.

Result: The method provides practical, scalable solution for computing compact preimage approximations where exact solvers fail to scale, with high-confidence guarantees and bounded error.

Conclusion: Probabilistic bootstrap-based approaches offer viable alternative to exact methods for neural network verification, enabling scalable preimage analysis with statistical guarantees.

Abstract: Although recent provable methods have been developed to compute preimage bounds for neural networks, their scalability is fundamentally limited by the #P-hardness of the problem. In this work, we adopt a novel probabilistic perspective, aiming to deliver solutions with high-confidence guarantees and bounded error. To this end, we investigate the potential of bootstrap-based and randomized approaches that are capable of capturing complex patterns in high-dimensional spaces, including input regions where a given output property holds. In detail, we introduce $\textbf{R}$andom $\textbf{F}$orest $\textbf{Pro}$perty $\textbf{Ve}$rifier ($\texttt{RF-ProVe}$), a method that exploits an ensemble of randomized decision trees to generate candidate input regions satisfying a desired output property and refines them through active resampling. Our theoretical derivations offer formal statistical guarantees on region purity and global coverage, providing a practical, scalable solution for computing compact preimage approximations in cases where exact solvers fail to scale.

[826] SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

Zhixiong Zhao, Fangxin Liu, Junjie Wang, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan

Main category: cs.LG

TL;DR: SpecQuant is a two-stage Fourier-based framework for extreme LLM compression that achieves 4-bit quantization for both weights and activations with minimal accuracy loss.

DetailsMotivation: The emergence of accurate open LLMs has created demand for efficient quantization techniques to enable deployment on end-user devices, particularly targeting ultra-low-bit quantization for both activations and weights.

Method: Two-stage framework: 1) Smooth activation outliers and transfer them into weight matrix; 2) Apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy. Includes lightweight truncation module for runtime adaptability.

Result: On LLaMA-3 8B, achieves 4-bit quantization for both weights and activations, narrowing zero-shot accuracy gap to only 1.5% compared to full precision, with 2× faster inference and 3× lower memory usage.

Conclusion: Fourier frequency domain perspective provides effective approach for extreme LLM compression, enabling efficient deployment while maintaining high accuracy through frequency-based quantization techniques.

Abstract: The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression – targeting ultra-low-bit quantization for both activations and weights – from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.

[827] Clifford Algebraic Rotor Embeddings : Maybe embeddings should start to CARE

Sameeksha Sriram, Ayush Paliwal, Alexander S. Ecker, Chase van de Geijn

Main category: cs.LG

TL;DR: This paper introduces Quaternion Rotary Embeddings (QuatRo) and Clifford Algebraic Rotary Embeddings (CARE) as extensions to Rotary Positional Embeddings (RoPE), addressing the non-commutativity issues in existing spherical RoPE variants while enabling higher-dimensional positional encoding.

DetailsMotivation: To overcome the non-commutative nature and rotation sequence ambiguity in Spherical RoPE while extending RoPE's capabilities to higher-dimensional inputs and maintaining desirable properties like shift-equivariance.

Method: Proposes QuatRo using quaternions to parameterize rotation axes, then generalizes to CARE using Clifford algebra and geometric rotors acting on multivectors, allowing rotary embeddings in arbitrary dimensions with positional information encoded across multiple grades.

Result: Shows that Mixed RoPE and Spherical RoPE are special cases of QuatRo, and demonstrates that Clifford-based approaches enable rotary embeddings in arbitrary dimensions with multivector positional encoding.

Conclusion: The quaternion and Clifford algebra approaches provide principled extensions to RoPE that overcome commutativity issues while enabling more flexible and higher-dimensional positional encoding schemes.

Abstract: Rotary Positional Embeddings (RoPE) have demonstrated exceptional performance as a positional encoding method, consistently outperforming their baselines. While recent work has sought to extend RoPE to higher-dimensional inputs, many such extensions are non-commutative, thereby forfeiting RoPE’s shift-equivariance property. Spherical RoPE is one such non-commutative variant, motivated by the idea of rotating embedding vectors on spheres rather than circles. However, spherical rotations are inherently non-commutative, making the choice of rotation sequence ambiguous. In this work, we explore a quaternion-based approach – Quaternion Rotary Embeddings (QuatRo) – in place of Euler angles, leveraging quaternions’ ability to represent 3D rotations to parameterize the axes of rotation. We show Mixed RoPE and Spherical RoPE to be special cases of QuatRo. Further, we propose a generalization of QuatRo to Clifford Algebraic Rotary Embeddings (CARE) using geometric algebra. Viewing quaternions as the even subalgebra of Cl(3,0,0), we extend the notion of rotary embeddings from quaternions to Clifford rotors acting on multivectors. This formulation enables two key generalizations: (1) extending rotary embeddings to arbitrary dimensions, and (2) encoding positional information in multivectors of multiple grades, not just vectors. We present preliminary experiments comparing spherical, quaternion, and Clifford-based rotary embeddings.

[828] Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural Networks

Rajit Rajpal, Benedict Leimkuhler, Yuanhao Jiang

Main category: cs.LG

TL;DR: SA-SGLD is an adaptive stochastic gradient MCMC method that uses time rescaling to automatically adjust stepsizes based on local gradient norms, improving sampling accuracy in high-curvature regions without bias.

DetailsMotivation: Existing SGMCMC methods are highly sensitive to stepsize choices, and adaptive variants like pSGLD often fail to sample the correct invariant measure without costly divergence correction terms.

Method: Built on the SamAdams framework, SA-SGLD employs time rescaling to modulate stepsize according to monitored quantities (typically local gradient norm), automatically adjusting stepsizes in high-curvature vs flat regions.

Result: SA-SGLD achieves more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with Bayesian neural networks using sharp priors.

Conclusion: The proposed SA-SGLD method improves both stability and mixing in Bayesian neural network posterior sampling without introducing bias, particularly effective in high-curvature regions.

Abstract: Bayesian neural networks (BNNs) require scalable sampling algorithms to approximate posterior distributions over parameters. Existing stochastic gradient Markov Chain Monte Carlo (SGMCMC) methods are highly sensitive to the choice of stepsize and adaptive variants such as pSGLD typically fail to sample the correct invariant measure without addition of a costly divergence correction term. In this work, we build on the recently proposed `SamAdams’ framework for timestep adaptation (Leimkuhler, Lohmann, and Whalley 2025), introducing an adaptive scheme: SA-SGLD, which employs time rescaling to modulate the stepsize according to a monitored quantity (typically the local gradient norm). SA-SGLD can automatically shrink stepsizes in regions of high curvature and expand them in flatter regions, improving both stability and mixing without introducing bias. We show that our method can achieve more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with BNNs using sharp priors.

[829] Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion

Feng Guo, Yuntao Wen, Shen Gao, Junshuo Zhang, Shuo Shang

Main category: cs.LG

TL;DR: KUnBR is a novel machine unlearning method that uses knowledge density estimation to identify harmful knowledge-rich layers and employs layer re-insertion strategy to thoroughly eliminate harmful knowledge from LLMs while maintaining model utility.

DetailsMotivation: Existing unlearning methods often fail to completely remove harmful knowledge from LLMs, leaving residual knowledge that can be easily recovered, which poses privacy, regulatory compliance, and ethical concerns.

Method: Proposes Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR) that: 1) uses knowledge density estimation to quantify and locate layers with rich harmful knowledge, 2) employs layer re-insertion strategy to extract and re-insert harmful knowledge-rich layers into original LLM to bypass gradient obstruction.

Result: Extensive experiments on multiple unlearning and general capability benchmarks show KUnBR achieves state-of-the-art forgetting performance while maintaining model utility.

Conclusion: KUnBR effectively addresses limitations of existing unlearning methods by providing precise harmful knowledge localization and thorough elimination through innovative re-insertion strategy, achieving superior unlearning performance without compromising model capabilities.

Abstract: Machine unlearning, which selectively removes harmful knowledge from a pre-trained model without retraining from scratch, is crucial for addressing privacy, regulatory compliance, and ethical concerns in Large Language Models (LLMs). However, existing unlearning methods often struggle to thoroughly remove harmful knowledge, leaving residual harmful knowledge that can be easily recovered. To address these limitations, we propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR), a novel approach that first identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy. Our method introduces knowledge density estimation to quantify and locate layers containing the most harmful knowledge, enabling precise unlearning. Additionally, we design a layer re-insertion strategy that extracts and re-inserts harmful knowledge-rich layers into the original LLM, bypassing gradient obstruction caused by cover layers and ensuring effective gradient propagation during unlearning. Extensive experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance while maintaining model utility.

[830] Do traveling waves make good positional encodings?

Chase van de Geijn, Ayush Paliwal, Timo Lüddecke, Alexander S. Ecker

Main category: cs.LG

TL;DR: RollPE is a novel positional encoding method using circular roll operations to create relative positional shifts, outperforming traditional absolute encodings and comparable to RoPE.

DetailsMotivation: Transformers need positional encoding to overcome permutation invariance of self-attention, with recent focus on relative encodings for better translation equivariance.

Method: Apply circular roll operations to query and key tensors in self-attention, inducing relative phase shifts that compute attention based on positional differences rather than absolute indices.

Result: RollPE significantly outperforms traditional absolute positional embeddings and achieves comparable performance to RoPE.

Conclusion: RollPE provides a simplified alternative to RoPE through traveling wave mechanics, potentially relating to information flow processes in the brain.

Abstract: Transformers rely on positional encoding to compensate for the inherent permutation invariance of self-attention. Traditional approaches use absolute sinusoidal embeddings or learned positional vectors, while more recent methods emphasize relative encodings to better capture translation equivariances. In this work, we propose RollPE, a novel positional encoding mechanism based on traveling waves, implemented by applying a circular roll operation to the query and key tensors in self-attention. This operation induces a relative shift in phase across positions, allowing the model to compute attention as a function of positional differences rather than absolute indices. We show this simple method significantly outperforms traditional absolute positional embeddings and is comparable to RoPE. We derive a continuous case of RollPE which implicitly imposes a topographic structure on the query and key space. We further derive a mathematical equivalence of RollPE to a particular configuration of RoPE. Viewing RollPE through the lens of traveling waves may allow us to simplify RoPE and relate it to processes of information flow in the brain.

[831] H-Model: Dynamic Neural Architectures for Adaptive Processing

Dmytro Hospodarchuk

Main category: cs.LG

TL;DR: A neural network architecture with dynamic internal structure adjustment based on input data, using a routing mechanism for adaptive computation.

DetailsMotivation: To explore adaptable and potentially more interpretable networks that can learn both representations and the structure of computation itself, inspired by dynamic reasoning processes.

Method: Proposes a neural network with routing mechanism that allows layers to influence output propagation, enabling iterative and adaptive computation conditioned on data and internal state.

Result: Preliminary investigation shows promise but limited by computational constraints; full potential requires future experiments under better conditions.

Conclusion: This work presents a conceptual prototype for adaptable networks rather than competing with state-of-the-art models, opening new directions for research in dynamic computational structures.

Abstract: This article explores the design and experimentation of a neural network architecture capable of dynamically adjusting its internal structure based on the input data. The proposed model introduces a routing mechanism that allows each layer to influence how its outputs are propagated through the network, enabling iterative and adaptive computation. This concept is loosely inspired by the idea of thought processes and dynamic reasoning, where information flow is conditioned not only on the data itself, but also on the internal state of the system. It is important to note that this work does not aim to compete with state-of-the-art language models in terms of performance. Instead, it presents a conceptual prototype-an architectural framework that opens up a new direction for exploring adaptable and potentially more interpretable networks. The goal is not optimization of existing benchmarks but rather the proposal of a system that can learn not only representations, but also the structure of computation itself. Due to practical constraints in computing resources and data, this study remains a preliminary investigation. Nevertheless, initial observations show promise, and the architecture’s full potential can only be evaluated in future experiments under more favorable computational conditions.

[832] Evaluation of LLM-based Explanations for a Learning Analytics Dashboard

Alina Deriyeva, Benjamin Paassen

Main category: cs.LG

TL;DR: LLM-generated explanations in learning analytics dashboards are preferred over standalone dashboards and human teacher explanations, enhancing learning experience while maintaining pedagogical standards.

DetailsMotivation: To improve interpretability of learning analytics dashboards and support self-regulated learning by providing better data explanations.

Method: Used a large language model to generate verbal explanations of dashboard data, compared against standalone dashboard and human teacher explanations in an expert study with 12 university educators.

Result: LLM-based explanations of skill states and learning recommendations were significantly more favored than other conditions.

Conclusion: LLMs can effectively enhance learning analytics dashboards by providing interpretable explanations that improve learning experience while meeting pedagogical standards.

Abstract: Learning Analytics Dashboards can be a powerful tool to support self-regulated learning in Digital Learning Environments and promote development of meta-cognitive skills, such as reflection. However, their effectiveness can be affected by the interpretability of the data they provide. To assist in the interpretation, we employ a large language model to generate verbal explanations of the data in the dashboard and evaluate it against a standalone dashboard and explanations provided by human teachers in an expert study with university level educators (N=12). We find that the LLM-based explanations of the skill state presented in the dashboard, as well as general recommendations on how to proceed with learning within the course are significantly more favored compared to the other conditions. This indicates that using LLMs for interpretation purposes can enhance the learning experience for learners while maintaining the pedagogical standards approved by teachers.

[833] Synergistic Feature Fusion for Latent Lyrical Classification: A Gated Deep Learning Architecture

M. A. Gameiro

Main category: cs.LG

TL;DR: A novel Synergistic Fusion Layer (SFL) architecture that fuses deep semantic features with structural cues using gated mechanisms outperforms traditional feature concatenation for lyrical content classification, achieving superior accuracy and reliability.

DetailsMotivation: To address the challenge of integrating complex high-dimensional deep semantic features with simple interpretable structural cues for lyrical content classification, moving beyond simple feature concatenation approaches.

Method: Proposed a Synergistic Fusion Layer (SFL) architecture using gated mechanisms to modulate Sentence-BERT embeddings with low-dimensional auxiliary features, reframing lyrical clustering as binary classification between a dominant homogeneous cluster and all other content.

Result: SFL achieved 0.9894 accuracy and 0.9894 Macro F1 score, outperforming Random Forest baseline (0.9868 accuracy). Showed 93% reduction in Expected Calibration Error (0.0035 vs 0.0500) and 2.5x lower Log Loss (0.0304 vs 0.0772).

Conclusion: Non-linear gating is superior to simple feature concatenation, establishing SFL as a robust and trustworthy system for complex multimodal lyrical analysis with excellent calibration and reliability.

Abstract: This study addresses the challenge of integrating complex, high-dimensional deep semantic features with simple, interpretable structural cues for lyrical content classification. We introduce a novel Synergistic Fusion Layer (SFL) architecture, a deep learning model utilizing a gated mechanism to modulate Sentence-BERT embeddings (Fdeep) using low-dimensional auxiliary features (Fstruct). The task, derived from clustering UMAP-reduced lyrical embeddings, is reframed as binary classification, distinguishing a dominant, homogeneous cluster (Class 0) from all other content (Class 1). The SFL model achieved an accuracy of 0.9894 and a Macro F1 score of 0.9894, outperforming a comprehensive Random Forest (RF) baseline that used feature concatenation (Accuracy = 0.9868). Crucially, the SFL model demonstrated vastly superior reliability and calibration, exhibiting a 93% reduction in Expected Calibration Error (ECE = 0.0035) and a 2.5x lower Log Loss (0.0304) compared to the RF baseline (ECE = 0.0500; Log Loss = 0.0772). This performance validates the architectural hypothesis that non-linear gating is superior to simple feature concatenation, establishing the SFL model as a robust and trustworthy system for complex multimodal lyrical analysis.

[834] Beyond One-Way Pruning: Bidirectional Pruning-Regrowth for Extreme Accuracy-Sparsity Tradeoff

Junchen Liu, Yi Sheng

Main category: cs.LG

TL;DR: Proposes a bidirectional pruning-regrowth strategy to overcome performance degradation at high sparsity levels by regenerating critical connections from an extremely compressed network.

DetailsMotivation: Model pruning faces performance degradation when sparsity exceeds certain thresholds, limiting achievable compression ratios and preventing models from meeting hardware size constraints.

Method: Bidirectional pruning-regrowth strategy that starts from an extremely compressed network and selectively regenerates critical connections to recover lost performance.

Result: Effectively mitigates the sharp accuracy drop commonly observed under high sparsity conditions.

Conclusion: The proposed method enables higher compression ratios while maintaining model performance, making models operable on hardware platforms with stringent size constraints.

Abstract: As a widely adopted model compression technique, model pruning has demonstrated strong effectiveness across various architectures. However, we observe that when sparsity exceeds a certain threshold, both iterative and one-shot pruning methods lead to a steep decline in model performance. This rapid degradation limits the achievable compression ratio and prevents models from meeting the stringent size constraints required by certain hardware platforms, rendering them inoperable. To overcome this limitation, we propose a bidirectional pruning-regrowth strategy. Starting from an extremely compressed network that satisfies hardware constraints, the method selectively regenerates critical connections to recover lost performance, effectively mitigating the sharp accuracy drop commonly observed under high sparsity conditions.

[835] Learning with Preserving for Continual Multitask Learning

Hanchen David Wang, Siwoo Bae, Zirong Chen, Meiyi Ma

Main category: cs.LG

TL;DR: LwP is a novel continual learning framework that preserves geometric structure in representation space to prevent catastrophic forgetting in multitask learning scenarios, outperforming state-of-the-art methods without requiring replay buffers.

DetailsMotivation: Existing continual learning methods fail in Continual Multitask Learning (CMTL) settings where models sequentially learn new tasks on shared data streams, as they learn fragmented task-specific features that interfere with each other.

Method: Introduces Learning with Preserving (LwP) framework with Dynamically Weighted Distance Preservation (DWDP) loss that regularizes pairwise distances between latent data representations to maintain geometric structure of shared representation space.

Result: Extensive evaluations show LwP mitigates catastrophic forgetting, consistently outperforms state-of-the-art baselines in CMTL tasks, demonstrates superior robustness to distribution shifts, and is the only approach to surpass single-task learning baseline.

Conclusion: LwP effectively addresses CMTL challenges by preserving geometric structure in representation space, making it suitable for real-world dynamic environments and privacy-conscious applications without requiring replay buffers.

Abstract: Artificial intelligence systems in critical fields like autonomous driving and medical imaging analysis often continually learn new tasks using a shared stream of input data. For instance, after learning to detect traffic signs, a model may later need to learn to classify traffic lights or different types of vehicles using the same camera feed. This scenario introduces a challenging setting we term Continual Multitask Learning (CMTL), where a model sequentially learns new tasks on an underlying data distribution without forgetting previously learned abilities. Existing continual learning methods often fail in this setting because they learn fragmented, task-specific features that interfere with one another. To address this, we introduce Learning with Preserving (LwP), a novel framework that shifts the focus from preserving task outputs to maintaining the geometric structure of the shared representation space. The core of LwP is a Dynamically Weighted Distance Preservation (DWDP) loss that prevents representation drift by regularizing the pairwise distances between latent data representations. This mechanism of preserving the underlying geometric structure allows the model to retain implicit knowledge and support diverse tasks without requiring a replay buffer, making it suitable for privacy-conscious applications. Extensive evaluations on time-series and image benchmarks show that LwP not only mitigates catastrophic forgetting but also consistently outperforms state-of-the-art baselines in CMTL tasks. Notably, our method shows superior robustness to distribution shifts and is the only approach to surpass the strong single-task learning baseline, underscoring its effectiveness for real-world dynamic environments.

[836] Homotopy-Guided Self-Supervised Learning of Parametric Solutions for AC Optimal Power Flow

Shimiao Li, Aaron Tuor, Draguna Vrabie, Larry Pileggi, Jan Drgona

Main category: cs.LG

TL;DR: A homotopy-guided self-supervised learning method for parametric AC-OPF problems that improves convergence stability and feasibility without requiring labeled optimal solutions.

DetailsMotivation: Standard learning approaches often fail to converge to feasible, high-quality solutions for AC-OPF due to its inherent nonconvexity and challenging optimization landscapes.

Method: Constructs a continuous deformation of objective and constraints during training, starting from a relaxed problem with broad basin of attraction and gradually transforming toward the original problem.

Result: Significantly increases feasibility rates compared to non-homotopy baselines while achieving objective values comparable to full OPF solvers on IEEE AC-OPF benchmarks.

Conclusion: Demonstrates promise of homotopy-based heuristics for scalable, constraint-aware learning to optimize in power system optimization.

Abstract: Learning to optimize (L2O) parametric approximations of AC optimal power flow (AC-OPF) solutions offers the potential for fast, reusable decision-making in real-time power system operations. However, the inherent nonconvexity of AC-OPF results in challenging optimization landscapes, and standard learning approaches often fail to converge to feasible, high-quality solutions. This work introduces a \textit{homotopy-guided self-supervised L2O method} for parametric AC-OPF problems. The key idea is to construct a continuous deformation of the objective and constraints during training, beginning from a relaxed problem with a broad basin of attraction and gradually transforming it toward the original problem. The resulting learning process improves convergence stability and promotes feasibility without requiring labeled optimal solutions or external solvers. We evaluate the proposed method on standard IEEE AC-OPF benchmarks and show that homotopy-guided L2O significantly increases feasibility rates compared to non-homotopy baselines, while achieving objective values comparable to full OPF solvers. These findings demonstrate the promise of homotopy-based heuristics for scalable, constraint-aware L2O in power system optimization.

[837] A neural optimization framework for free-boundary diffeomorphic mapping problems and its applications

Zhehao Xu, Lok Ming Lui

Main category: cs.LG

TL;DR: Proposes SBN-Opt framework using Spectral Beltrami Network to optimize free-boundary diffeomorphisms with controllable geometric distortion, outperforming traditional methods in density-equalizing maps and surface registration.

DetailsMotivation: Free-boundary diffeomorphism optimization is challenging due to unconstrained boundaries and the need to preserve local bijectivity under large deformation. Traditional LSQC methods require landmark conditioning and cannot be used in gradient-based optimization.

Method: Developed Spectral Beltrami Network (SBN) as a neural surrogate that embeds LSQC energy into a multiscale mesh-spectral architecture, enabling SBN-Opt framework for free-boundary diffeomorphism optimization.

Result: Extensive experiments show SBN-Opt’s superiority over traditional numerical algorithms in density-equalizing maps and inconsistent surface registration, with explicit control over local geometric distortion.

Conclusion: SBN-Opt provides an effective framework for free-boundary diffeomorphism optimization that overcomes limitations of traditional LSQC methods and enables gradient-based optimization with controllable geometric distortion.

Abstract: Free-boundary diffeomorphism optimization is a core ingredient in the surface mapping problem but remains notoriously difficult because the boundary is unconstrained and local bijectivity must be preserved under large deformation. Numerical Least-Squares Quasiconformal (LSQC) theory, with its provable existence, uniqueness, similarity-invariance and resolution-independence, offers an elegant mathematical remedy. However, the conventional numerical algorithm requires landmark conditioning, and cannot be applied into gradient-based optimization. We propose a neural surrogate, the Spectral Beltrami Network (SBN), that embeds LSQC energy into a multiscale mesh-spectral architecture. Next, we propose the SBN guided optimization framework SBN-Opt which optimizes free-boundary diffeomorphism for the problem, with local geometric distortion explicitly controllable. Extensive experiments on density-equalizing maps and inconsistent surface registration demonstrate our SBN-Opt’s superiority over traditional numerical algorithms.

[838] Probabilistic Wildfire Susceptibility from Remote Sensing Using Random Forests and SHAP

Udaya Bhasker Cheerala, Varun Teja Chirukuri, Venkata Akhil Kumar Gummadi, Jintu Moni Bhuyan, Praveen Damacharla

Main category: cs.LG

TL;DR: This study develops a comprehensive wildfire risk map for California using Random Forest algorithm with SHAP-based explainable AI, identifying key ecosystem-specific drivers and high-risk areas.

DetailsMotivation: Wildfires pose significant global threats, especially in California due to climate, topography, vegetation, and human factors. There's a need for comprehensive risk assessment to enable targeted mitigation strategies.

Method: Applied Random Forest algorithm augmented with SHAP (Explainable AI) for model interpretation, using spatial and temporal validation strategies to assess performance.

Result: RF model showed strong predictive performance (AUC > 0.99), with temporal validation performing better than spatial. SHAP identified soil organic carbon, tree cover, and NDVI as key drivers for forests, while LST, elevation, and vegetation health dominated in grasslands. Central Valley/Northern Buttes had highest grassland risk; Northern Buttes/North Coast Redwoods had highest forest risk.

Conclusion: The RF-SHAP framework provides a robust, interpretable, and adaptable method for wildfire risk assessment, enabling informed decision-making and targeted mitigation strategies.

Abstract: Wildfires pose a significant global threat to ecosystems worldwide, with California experiencing recurring fires due to various factors, including climate, topographical features, vegetation patterns, and human activities. This study aims to develop a comprehensive wildfire risk map for California by applying the random forest (RF) algorithm, augmented with Explainable Artificial Intelligence (XAI) through Shapley Additive exPlanations (SHAP), to interpret model predictions. Model performance was assessed using both spatial and temporal validation strategies. The RF model demonstrated strong predictive performance, achieving near-perfect discrimination for grasslands (AUC = 0.996) and forests (AUC = 0.997). Spatial cross-validation revealed moderate transferability, yielding ROC-AUC values of 0.6155 for forests and 0.5416 for grasslands. In contrast, temporal split validation showed enhanced generalization, especially for forests (ROC-AUC = 0.6615, PR-AUC = 0.8423). SHAP-based XAI analysis identified key ecosystem-specific drivers: soil organic carbon, tree cover, and Normalized Difference Vegetation Index (NDVI) emerged as the most influential in forests, whereas Land Surface Temperature (LST), elevation, and vegetation health indices were dominant in grasslands. District-level classification revealed that Central Valley and Northern Buttes districts had the highest concentration of high-risk grasslands, while Northern Buttes and North Coast Redwoods dominated forested high-risk areas. This RF-SHAP framework offers a robust, comprehensible, and adaptable method for assessing wildfire risks, enabling informed decisions and creating targeted strategies to mitigate dangers.

[839] MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation

Penghui Niu, Jiashuai She, Taotao Cai, Yajuan Zhang, Ping Zhang, Junhua Gu, Jianxin Li

Main category: cs.LG

TL;DR: MPCM-Net is a novel cloud segmentation network that integrates partial attention convolutions with Mamba architectures to achieve superior accuracy and computational efficiency, along with a new clear-label dataset CSRC.

DetailsMotivation: Current deep learning approaches for ground-based cloud image segmentation have limitations including inefficient multi-scale context extraction, poor accuracy-throughput balance in attention mechanisms, and lack of global interdependencies in decoder features.

Method: Proposes MPCM-Net with MPAC encoder (MPC block with ParCM and ParSM for global spatial interaction, MPA block with ParAM and ParSM for discriminative feature extraction) and M2B decoder with SSHD for contextual loss mitigation while maintaining linear complexity.

Result: Extensive experiments on the new CSRC dataset demonstrate superior performance over state-of-the-art methods, achieving optimal balance between segmentation accuracy and inference speed.

Conclusion: MPCM-Net effectively addresses limitations of existing methods and the CSRC dataset provides a valuable benchmark for cloud segmentation research, with both dataset and source code being made publicly available.

Abstract: Ground-based cloud image segmentation is a critical research domain for photovoltaic power forecasting. Current deep learning approaches primarily focus on encoder-decoder architectural refinements. However, existing methodologies exhibit several limitations:(1)they rely on dilated convolutions for multi-scale context extraction, lacking the partial feature effectiveness and interoperability of inter-channel;(2)attention-based feature enhancement implementations neglect accuracy-throughput balance; and (3)the decoder modifications fail to establish global interdependencies among hierarchical local features, limiting inference efficiency. To address these challenges, we propose MPCM-Net, a Multi-scale network that integrates Partial attention Convolutions with Mamba architectures to enhance segmentation accuracy and computational efficiency. Specifically, the encoder incorporates MPAC, which comprises:(1)a MPC block with ParCM and ParSM that enables global spatial interaction across multi-scale cloud formations, and (2)a MPA block combining ParAM and ParSM to extract discriminative features with reduced computational complexity. On the decoder side, a M2B is employed to mitigate contextual loss through a SSHD that maintains linear complexity while enabling deep feature aggregation across spatial and scale dimensions. As a key contribution to the community, we also introduce and release a dataset CSRC, which is a clear-label, fine-grained segmentation benchmark designed to overcome the critical limitations of existing public datasets. Extensive experiments on CSRC demonstrate the superior performance of MPCM-Net over state-of-the-art methods, achieving an optimal balance between segmentation accuracy and inference speed. The dataset and source code will be available at https://github.com/she1110/CSRC.

[840] Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

Longhua Li, Lei Qi, Xin Geng

Main category: cs.LG

TL;DR: Proposes WPAC and PIAD methods to transform pre-trained ViTs into stratified knowledge-density super-networks for efficient multi-scale deployment.

DetailsMotivation: Training and deploying multiple ViT models for different resource constraints is costly and inefficient.

Method: WPAC uses weighted PCA for attention contraction to concentrate knowledge into critical weights. PIAD uses progressive importance-aware dropout to promote knowledge stratification.

Result: WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers strong performance compared to state-of-the-art model compression and expansion methods.

Conclusion: The proposed approach enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes, providing an efficient alternative to training multiple separate models.

Abstract: Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbf{W}eighted \textbf{P}CA for \textbf{A}ttention \textbf{C}ontraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbf{P}rogressive \textbf{I}mportance-\textbf{A}ware \textbf{D}ropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

[841] A Bayesian Model for Multi-stage Censoring

Shuvom Sadhuka, Sophia Lin, Emma Pierson, Bonnie Berger

Main category: cs.LG

TL;DR: Bayesian model for healthcare funnel decision structures that addresses selective censoring bias in ground truth outcomes, applied to emergency department admissions showing gender-based differences in ICU admission thresholds.

DetailsMotivation: Healthcare decision funnels have progressive stages where ground truth outcomes are only revealed at the end, creating selective censoring that introduces statistical biases, particularly in underserved groups whose outcomes are more frequently censored.

Method: Developed a Bayesian model for funnel decision structures, drawing from prior work on selective labels and censoring, and applied it to emergency department visit data where mortality is only observed for admitted patients.

Result: Model accurately recovered true parameters in synthetic settings and predicted outcomes for censored patients better than baselines. In real data, found gender-based differences: women require higher mortality risk threshold (5.1%) than men (4.5%) for ICU admission.

Conclusion: The Bayesian model effectively addresses selective censoring bias in healthcare funnel structures and reveals systematic differences in decision thresholds across patient groups, highlighting potential disparities in care.

Abstract: Many sequential decision settings in healthcare feature funnel structures characterized by a series of stages, such as screenings or evaluations, where the number of patients who advance to each stage progressively decreases and decisions become increasingly costly. For example, an oncologist may first conduct a breast exam, followed by a mammogram for patients with concerning exams, followed by a biopsy for patients with concerning mammograms. A key challenge is that the ground truth outcome, such as the biopsy result, is only revealed at the end of this funnel. The selective censoring of the ground truth can introduce statistical biases in risk estimation, especially in underserved patient groups, whose outcomes are more frequently censored. We develop a Bayesian model for funnel decision structures, drawing from prior work on selective labels and censoring. We first show in synthetic settings that our model is able to recover the true parameters and predict outcomes for censored patients more accurately than baselines. We then apply our model to a dataset of emergency department visits, where in-hospital mortality is observed only for those who are admitted to either the hospital or ICU. We find that there are gender-based differences in hospital and ICU admissions. In particular, our model estimates that the mortality risk threshold to admit women to the ICU is higher for women (5.1%) than for men (4.5%).

[842] R-Tuning: Wavelet-Decomposed Replay and Semantic Alignment for Continual Adaptation of Pretrained Time-Series Models

Tianyi Yin, Jingwei Wang, Chenze Wang, Han Wang, Jiexuan Cai, Min Liu, Yunlong Ma, Kun Gao, Yuting Song, Weiming Shen

Main category: cs.LG

TL;DR: R-Tuning is a novel framework for continual adaptation of pre-trained time-series models that addresses catastrophic forgetting through frequency-aware replay and latent consistency constraints, achieving significant performance improvements on both new and old tasks.

DetailsMotivation: Pre-trained models struggle with evolving data distributions in time-series forecasting, and fine-tuning on new data alone causes catastrophic forgetting. The challenge is compounded by limited access to original training data.

Method: Uses frequency-aware replay strategy with wavelet-based decomposition across multiple frequency bands to generate trend-preserving and fusion-enhanced variants. Introduces latent consistency constraint to align new representations with prior task space for joint optimization in a compact latent space.

Result: Reduces MAE and MSE by up to 46.9% and 46.8% on new tasks, while preserving prior knowledge with gains up to 5.7% and 6.0% on old tasks. Outperforms state-of-the-art baselines even when synthetic samples account for only 5% of new task dataset.

Conclusion: R-Tuning effectively addresses catastrophic forgetting in time-series forecasting through its unified latent space approach and frequency-aware replay strategy, demonstrating robust knowledge retention and adaptation capabilities.

Abstract: Pre-trained models have demonstrated exceptional generalization capabilities in time-series forecasting; however, adapting them to evolving data distributions remains a significant challenge. A key hurdle lies in accessing the original training data, as fine-tuning solely on new data often leads to catastrophic forgetting. To address this issue, we propose Replay Tuning (R-Tuning), a novel framework designed for the continual adaptation of pre-trained time-series models. R-Tuning constructs a unified latent space that captures both prior and current task knowledge through a frequency-aware replay strategy. Specifically, it augments model-generated samples via wavelet-based decomposition across multiple frequency bands, generating trend-preserving and fusion-enhanced variants to improve representation diversity and replay efficiency. To further reduce reliance on synthetic samples, R-Tuning introduces a latent consistency constraint that aligns new representations with the prior task space. This constraint guides joint optimization within a compact and semantically coherent latent space, ensuring robust knowledge retention and adaptation. Extensive experimental results demonstrate the superiority of R-Tuning, which reduces MAE and MSE by up to 46.9% and 46.8%, respectively, on new tasks, while preserving prior knowledge with gains of up to 5.7% and 6.0% on old tasks. Notably, under few-shot settings, R-Tuning outperforms all state-of-the-art baselines even when synthetic proxy samples account for only 5% of the new task dataset.

[843] Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He

Main category: cs.LG

TL;DR: HSO is a training-free bi-level optimization framework that accelerates diffusion model sampling by finding optimal timestep schedules for fixed small NFE, achieving state-of-the-art performance with minimal computational cost.

DetailsMotivation: Existing schedule optimization methods struggle to simultaneously satisfy effectiveness, adaptivity, practical robustness, and computational efficiency principles needed for diffusion model acceleration.

Method: Hierarchical-Schedule-Optimizer (HSO) uses bi-level optimization with upper-level global search for initialization and lower-level local optimization for schedule refinement, guided by Midpoint Error Proxy (MEP) and Spacing-Penalized Fitness (SPF).

Result: HSO achieves FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1 using only 5 NFE, with one-time optimization cost under 8 seconds.

Conclusion: HSO presents a highly practical and efficient paradigm for diffusion model acceleration without costly retraining, setting new state-of-the-art in low-NFE regime.

Abstract: Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

[844] Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

Fei Song, Yi Li, Rui Wang, Jiahuan Zhou, Changwen Zheng, Jiangmeng Li

Main category: cs.LG

TL;DR: The paper proposes a Doubly Debiased Test-Time Prompt Tuning method to address prompt optimization bias in vision-language models by incorporating dynamic retrieval-augmented modulation and reliability-aware prompt optimization.

DetailsMotivation: Test-time prompt tuning for vision-language models suffers from prompt optimization bias, where tuning prompts solely on unlabeled test data leads to overconfident but incorrect outputs and cross-modal misalignment, resulting in suboptimal performance.

Method: The method includes: 1) Dynamic retrieval-augmented modulation that retrieves high-confidence knowledge from a dynamic knowledge base using test image features, 2) Reliability-aware prompt optimization with confidence-based weighted ensemble and cross-modal consistency distillation for regularization.

Result: Extensive experiments on 15 benchmark datasets show the method outperforms baselines in both natural distribution shifts and cross-dataset generalization scenarios.

Conclusion: The proposed doubly debiased approach effectively mitigates prompt optimization bias in vision-language models, improving generalization performance across various test scenarios.

Abstract: Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.

[845] AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation

Jiayin Zhu, Linlin Yang, Yicong Li, Angela Yao

Main category: cs.LG

TL;DR: AnchorDS improves text-to-3D generation by treating guidance as dynamic rather than static, using image-anchored score distillation to prevent semantic over-smoothing and enhance detail, color, and semantic consistency.

DetailsMotivation: Existing text-to-3D methods treat guidance from 2D generative models as static, leading to semantic over-smoothing artifacts where semantic cues are suppressed or merged, resulting in inconsistent trajectories.

Method: Reformulates text-to-3D optimization as mapping evolving source to fixed target distributions. Uses dual-conditioned latent space (text + rendered image), introduces AnchorDS for state-anchored guidance, penalizes erroneous source estimates, and adds lightweight filter/fine-tuning strategies.

Result: Produces finer-grained detail, more natural colors, and stronger semantic consistency, especially for complex prompts, while maintaining efficiency. Extensive experiments show superior quality and efficiency over previous methods.

Conclusion: AnchorDS successfully addresses semantic over-smoothing by treating guidance dynamically and anchoring it with image conditions, leading to significant improvements in text-to-3D generation quality and consistency.

Abstract: Optimization-based text-to-3D methods distill guidance from 2D generative models via Score Distillation Sampling (SDS), but implicitly treat this guidance as static. This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues, leading to “semantic over-smoothing” artifacts. As such, we reformulate text-to-3D optimization as mapping a dynamically evolving source distribution to a fixed target distribution. We cast the problem into a dual-conditioned latent space, conditioned on both the text prompt and the intermediately rendered image. Given this joint setup, we observe that the image condition naturally anchors the current source distribution. Building on this insight, we introduce AnchorDS, an improved score distillation mechanism that provides state-anchored guidance with image conditions and stabilizes generation. We further penalize erroneous source estimates and design a lightweight filter strategy and fine-tuning strategy that refines the anchor with negligible overhead. AnchorDS produces finer-grained detail, more natural colours, and stronger semantic consistency, particularly for complex prompts, while maintaining efficiency. Extensive experiments show that our method surpasses previous methods in both quality and efficiency.

[846] Toward Dignity-Aware AI: Next-Generation Elderly Monitoring from Fall Detection to ADL

Xun Shao, Aoba Otani, Yuto Hirasuka, Runji Cai, Seng W. Loke

Main category: cs.LG

TL;DR: This paper proposes a next-generation elderly monitoring system that transitions from simple fall detection to comprehensive Activities of Daily Living (ADL) recognition using privacy-preserving, edge-deployed federated AI systems.

DetailsMotivation: To support independence and dignity in aging societies by moving beyond basic fall detection toward understanding daily routines through comprehensive ADL monitoring.

Method: Used SISFall dataset and GAN-augmented variants as proxy tasks, conducted experiments on federated learning with non-IID conditions, and embedded deployment on Jetson Orin Nano devices.

Result: Demonstrated feasibility of the approach through initial results, showing potential for robust ADL recognition while addressing privacy concerns through federated learning.

Conclusion: The work provides early evidence and a roadmap for transitioning from single-task detection to comprehensive daily activity recognition, highlighting challenges like domain shift, data scarcity, and privacy risks that need to be addressed for sustainable human-centered elderly care AI.

Abstract: This position paper envisions a next-generation elderly monitoring system that moves beyond fall detection toward the broader goal of Activities of Daily Living (ADL) recognition. Our ultimate aim is to design privacy-preserving, edge-deployed, and federated AI systems that can robustly detect and understand daily routines, supporting independence and dignity in aging societies. At present, ADL-specific datasets are still under collection. As a preliminary step, we demonstrate feasibility through experiments using the SISFall dataset and its GAN-augmented variants, treating fall detection as a proxy task. We report initial results on federated learning with non-IID conditions, and embedded deployment on Jetson Orin Nano devices. We then outline open challenges such as domain shift, data scarcity, and privacy risks, and propose directions toward full ADL monitoring in smart-room environments. This work highlights the transition from single-task detection to comprehensive daily activity recognition, providing both early evidence and a roadmap for sustainable and human-centered elderly care AI.

[847] Benchmarking GNNs for OOD Materials Property Prediction with Uncertainty Quantification

Liqin Tan, Pin Chen, Menghan Liu, Xiean Wang, Jianhuan Cen, Qingsong Zou

Main category: cs.LG

TL;DR: MatUQ is a benchmark framework for evaluating GNNs on OOD materials property prediction with uncertainty quantification, featuring 1,375 tasks across six datasets using novel splitting strategies.

DetailsMotivation: To address the challenge of evaluating graph neural networks on out-of-distribution materials property prediction with reliable uncertainty quantification in materials discovery applications.

Method: Created benchmark with 1,375 OOD tasks using OFM-based and novel SOAP-LOCO splitting strategy; evaluated 12 GNN models with unified uncertainty-aware training combining Monte Carlo Dropout and Deep Evidential Regression; introduced D-EviU uncertainty metric.

Result: Uncertainty-aware training reduced errors by 70.6% on average; no single model dominated universally - older models remained competitive while newer ones excelled on specific properties; D-EviU showed strongest correlation with prediction errors.

Conclusion: The benchmark provides practical insights for model selection under distribution shifts in materials discovery, demonstrating the importance of uncertainty-aware training and specialized model choices for different material properties.

Abstract: We present MatUQ, a benchmark framework for evaluating graph neural networks (GNNs) on out-of-distribution (OOD) materials property prediction with uncertainty quantification (UQ). MatUQ comprises 1,375 OOD prediction tasks constructed from six materials datasets using five OFM-based and a newly proposed structure-aware splitting strategy, SOAP-LOCO, which captures local atomic environments more effectively. We evaluate 12 representative GNN models under a unified uncertainty-aware training protocol that combines Monte Carlo Dropout and Deep Evidential Regression (DER), and introduce a novel uncertainty metric, D-EviU, which shows the strongest correlation with prediction errors in most tasks. Our experiments yield two key findings. First, the uncertainty-aware training approach significantly improves model prediction accuracy, reducing errors by an average of 70.6% across challenging OOD scenarios. Second, the benchmark reveals that no single model dominates universally: earlier models such as SchNet and ALIGNN remain competitive, while newer models like CrystalFramer and SODNet demonstrate superior performance on specific material properties. These results provide practical insights for selecting reliable models under distribution shifts in materials discovery.

[848] Moirai 2.0: When Less Is More for Time Series Forecasting

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Doyen Sahoo, Caiming Xiong, Silvio Savarese, Junnan Li

Main category: cs.LG

TL;DR: Moirai 2.0 is a time-series foundation model that improves on version 1.0 with a simpler decoder-only architecture, quantile forecasting, and multi-token prediction, achieving better accuracy and efficiency while being 30x smaller and 2x faster.

DetailsMotivation: To create a more efficient and accurate time-series forecasting model by simplifying the architecture from Moirai 1.0's complex masked-encoder approach to a streamlined decoder-only design.

Method: Uses decoder-only architecture with quantile forecasting and multi-token prediction, trained on 36M time series. Replaces previous masked-encoder training, multi-patch inputs, and mixture-distribution outputs with single patch and quantile loss.

Result: Outperforms larger models in the same family, achieves top performance on Gift-Eval benchmark, and shows 30x size reduction and 2x speed improvement over Moirai 1.0-Large while maintaining better performance.

Conclusion: The decoder-only backbone with recursive multi-quantile decoding drives most improvements. Performance plateaus with parameter scaling and declines at longer horizons, suggesting future work on data scaling and long-horizon modeling.

Abstract: We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes – showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

[849] Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification

Xingqi Lin, Liangyu Chen, Min Wu, Min Zhang, Zhenbing Zeng

Main category: cs.LG

TL;DR: DeepPrism proposes a novel truncated rectangular prism method for tighter over-approximation of nonlinear activation functions in RNNs, improving robustness verification accuracy compared to state-of-the-art approaches.

DetailsMotivation: Existing methods over-approximate nonlinear activation functions with individual linear bounding planes, causing significant over-estimation and lower verification accuracy in RNN robustness verification.

Method: Proposes a truncated rectangular prism formed by two linear relaxation planes and a refinement-driven method to minimize volume and surface area for tighter over-approximation of the three-dimensional nonlinear surface from Hadamard product.

Result: DeepPrism prototype shows significant improvement over state-of-the-art approaches in image classification, speech recognition, and sentiment analysis tasks.

Conclusion: The proposed truncated rectangular prism method provides tighter over-approximation of nonlinear activation functions, leading to more accurate RNN robustness verification across multiple domains.

Abstract: Robustness verification is a promising technique for rigorously proving Recurrent Neural Networks (RNNs) robustly. A key challenge is to over-approximate the nonlinear activation functions with linear constraints, which can transform the verification problem into an efficiently solvable linear programming problem. Existing methods over-approximate the nonlinear parts with linear bounding planes individually, which may cause significant over-estimation and lead to lower verification accuracy. In this paper, in order to tightly enclose the three-dimensional nonlinear surface generated by the Hadamard product, we propose a novel truncated rectangular prism formed by two linear relaxation planes and a refinement-driven method to minimize both its volume and surface area for tighter over-approximation. Based on this approximation, we implement a prototype DeepPrism for RNN robustness verification. The experimental results demonstrate that \emph{DeepPrism} has significant improvement compared with the state-of-the-art approaches in various tasks of image classification, speech recognition and sentiment analysis.

[850] Bayesian Neural Networks with Monte Carlo Dropout for Probabilistic Electricity Price Forecasting

Abhinav Das, Stephan Schlüter

Main category: cs.LG

TL;DR: A probabilistic electricity price forecasting framework using Bayesian neural networks with Monte Carlo dropout that outperforms traditional benchmark models in both point predictions and uncertainty intervals.

DetailsMotivation: Traditional point forecasts fail to capture uncertainties in volatile electricity markets, limiting their utility for risk management in deregulated markets.

Method: Bayesian neural networks with Monte Carlo dropout, training separate models for each hour of the day to capture diurnal patterns, compared against GARCHX and LEAR benchmark models.

Result: The proposed BNN model outperforms benchmark models (GARCHX and LEAR) in terms of both point prediction accuracy and interval forecasting quality.

Conclusion: The framework serves as a reference for leveraging probabilistic neural models in energy market predictions, demonstrating superior performance over traditional econometric approaches.

Abstract: Accurate electricity price forecasting is critical for strategic decision-making in deregulated electricity markets, where volatility stems from complex supply-demand dynamics and external factors. Traditional point forecasts often fail to capture inherent uncertainties, limiting their utility for risk management. This work presents a framework for probabilistic electricity price forecasting using Bayesian neural networks (BNNs) with Monte Carlo (MC) dropout, training separate models for each hour of the day to capture diurnal patterns. A critical assessment and comparison with the benchmark model, namely: generalized autoregressive conditional heteroskedasticity with exogenous variable (GARCHX) model and the LASSO estimated auto-regressive model (LEAR), highlights that the proposed model outperforms the benchmark models in terms of point prediction and intervals. This work serves as a reference for leveraging probabilistic neural models in energy market predictions.

[851] Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

Hugo Huang

Main category: cs.LG

TL;DR: Proposes two novel input representations (SS-only and RGB+SS) using semantic segmentation to address memory consumption and POMDP challenges in 3D RL environments, achieving up to 98.6% memory reduction and performance improvements.

DetailsMotivation: Address two major challenges in 3D RL: high memory consumption from memory buffers and complexity of learning in partially observable Markov Decision Processes (POMDPs).

Method: Uses semantic segmentation on RGB color images to create SS-only and RGB+SS input representations, tested in ViZDoom deathmatches with perfect segmentation for controlled evaluation, and applies run-length encoding compression.

Result: SS-only reduced memory consumption by 66.6-98.6%, RGB+SS significantly enhanced RL agent performance, and density-based heatmapping effectively visualized movement patterns.

Conclusion: Semantic segmentation-based input representations effectively address memory and POMDP challenges in 3D RL, with SS-only providing major memory savings and RGB+SS improving performance, while overcoming common pitfalls in applying semantic segmentation to 3D environments.

Abstract: Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents’ performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents’ movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

[852] Simple Vision-Language Math Reasoning via Rendered Text

Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

Main category: cs.LG

TL;DR: A lightweight pipeline that renders LaTeX equations into images paired with chain-of-thought prompts, enabling compact multimodal models to achieve state-of-the-art math reasoning accuracy while maintaining general-domain competence.

DetailsMotivation: To develop an effective yet simple method for training vision-language models to solve math problems by leveraging visual representations of equations and structured reasoning prompts.

Method: Render LaTeX encoded equations into images and pair them with structured chain-of-thought prompts, using text-to-vision augmentation with compact multimodal architectures.

Result: Achieves state-of-the-art reasoning accuracy on math problems, matches or surpasses both open-source and proprietary math-focused solvers, and shows gains of up to 20% on tasks like MMMU, ChartQA, and DocVQA while preserving broad general-domain competence.

Conclusion: The approach demonstrates that rendering fidelity and prompt design are key drivers of performance, and that simple text-to-vision augmentation can enable compact models to excel at math reasoning while maintaining versatility across domains.

Abstract: We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

[853] Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

Arya Narang

Main category: cs.LG

TL;DR: Multimodal CNN combining dish names and images reduces calorie estimation error by 1.06 kcal (1.25% improvement) compared to image-only model.

DetailsMotivation: To determine if short textual inputs (dish names) can significantly improve calorie estimation accuracy over image-only models.

Method: Used TensorFlow and Nutrition5k dataset to train both image-only CNN and multimodal CNN that accepts both text and image inputs.

Result: MAE reduced from 84.76 kcal to 83.70 kcal (1.25% improvement) when using multimodal model compared to image-only baseline.

Conclusion: Short textual inputs provide statistically significant but modest improvements in calorie estimation accuracy.

Abstract: This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.

[854] Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental modelling

Julia Peters, Karin Mora, Miguel D. Mahecha, Chaonan Ji, David Montero, Clemens Mosig, Guido Kraemer

Main category: cs.LG

TL;DR: A framework that integrates Earth observation modalities into unified high-resolution embeddings for ecological analysis, overcoming fixed-scale limitations of existing models.

DetailsMotivation: Existing Earth observation foundation models operate at fixed spatial/temporal scales, limiting ecological analyses that need both fine spatial detail and high temporal fidelity.

Method: Two-stage representation learning: first model sensors independently, then combine into shared model at native 10m resolution with cloud-free Sentinel-2 frequency, enabling easy extension to new sensors.

Result: Learned embeddings show high spatial/semantic consistency across landscapes and encode ecologically meaningful patterns for Gross Primary Production modeling with sufficient temporal fidelity.

Conclusion: The framework provides flexible, analysis-ready representation learning for environmental applications requiring diverse spatial and temporal resolutions.

Abstract: Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.

[855] FSC-Net: Fast-Slow Consolidation Networks for Continual Learning

Mohamed El Gorrim

Main category: cs.LG

TL;DR: FSC-Net uses dual networks (fast for learning, slow for consolidation) to combat catastrophic forgetting in continual learning, achieving significant performance gains through simple MLP architecture and replay-based consolidation.

DetailsMotivation: Address catastrophic forgetting in neural networks by drawing inspiration from memory consolidation in neuroscience, where knowledge is gradually consolidated over time.

Method: Dual-network architecture with fast network (NN1) for immediate task learning and slow network (NN2) for gradual knowledge consolidation using replay and distillation. Simple MLP architecture outperforms complex variants.

Result: On Split-MNIST: 91.71% retention accuracy (+4.27pp gain over fast network alone). On Split-CIFAR-10: 33.31% retention (+8.20pp gain), though absolute performance remains modest. Pure replay without distillation works best.

Conclusion: Dual-timescale consolidation mechanism is more important than architectural complexity for mitigating catastrophic forgetting. Simple MLP with replay-based consolidation provides effective continual learning.

Abstract: Continual learning remains challenging due to catastrophic forgetting, where neural networks lose previously acquired knowledge when learning new tasks. Inspired by memory consolidation in neuroscience, we propose FSC-Net (Fast-Slow Consolidation Networks), a dual-network architecture that separates rapid task learning from gradual knowledge consolidation. Our method employs a fast network (NN1) for immediate adaptation to new tasks and a slow network (NN2) that consolidates knowledge through distillation and replay. Within the family of MLP-based NN1 variants we evaluated, consolidation effectiveness is driven more by methodology than architectural embellishments – a simple MLP outperforms more complex similarity-gated variants by 1.2pp. Through systematic hyperparameter analysis, we observed empirically that pure replay without distillation during consolidation achieves superior performance, consistent with the hypothesis that distillation from the fast network introduces recency bias. On Split-MNIST (30 seeds), FSC-Net achieves 91.71% +/- 0.62% retention accuracy, a +4.27pp gain over the fast network alone (87.43% +/- 1.27%, paired t=23.585, p < 1e-10). On Split-CIFAR-10 (5 seeds), our method achieves 33.31% +/- 0.38% retention with an +8.20pp gain over the fast network alone (25.11% +/- 1.61%, paired t=9.75, p < 1e-3), demonstrating +8.20pp gain, though absolute performance (33.31%) remains modest and below random expectation, highlighting need for stronger backbones. Our results provide empirical evidence that the dual-timescale consolidation mechanism, rather than architectural complexity, is central to mitigating catastrophic forgetting in this setting.

[856] Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

Tsogt-Ochir Enkhbayar

Main category: cs.LG

TL;DR: Introducing Model-X knockoffs to sparse autoencoder feature selection to control false discovery rate with finite-sample guarantees for reliable feature discovery in neural networks.

DetailsMotivation: To address the challenge of distinguishing real computational patterns from erroneous correlations in sparse autoencoders and provide principled feature selection with controlled false discovery rates.

Method: Using Model-X knockoffs with knockoff+ to control FDR under Gaussian surrogate assumptions, applied to 512 high-activity SAE latents from Pythia-70M for sentiment classification.

Result: Selected 129 features at target FDR q=0.1, showing 25% of examined latents carry task-relevant signal with 5.40x separation in knockoff statistics between selected and non-selected features.

Conclusion: The method provides a reproducible and principled framework combining SAEs with multiple-testing-aware inference, advancing mechanistic interpretability foundations.

Abstract: Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.

[857] Reasoning: From Reflection to Solution

Zixi Li

Main category: cs.LG

TL;DR: Reasoning is defined as iterative operator application in state spaces converging to fixed points, explaining LLM limitations and proposing a new architecture (OpenLM) that achieves 76% accuracy where SOTA LLMs fail completely.

DetailsMotivation: To understand whether large language models have learned genuine reasoning or just pattern-matching over reasoning traces, and to define what reasoning fundamentally requires.

Method: Proposed a theoretical framework defining reasoning as iterative operator application in state spaces, then developed OpenLM architecture implementing this principle through OpenOperator approach.

Result: OpenLM achieved 76% accuracy on OpenXOR benchmark where state-of-the-art LLMs achieved 0% accuracy, demonstrating the effectiveness of the proposed reasoning architecture.

Conclusion: Genuine reasoning requires iterative operator application in state spaces, and current LLM limitations can be overcome by architectures that explicitly provide this capability rather than relying on pattern-matching.

Abstract: What is reasoning? This question has driven centuries of philosophical inquiry, from Aristotle’s syllogisms to modern computational complexity theory. In the age of large language models achieving superhuman performance on benchmarks like GSM8K (95% accuracy) and HumanEval (90% pass@1), we must ask: have these systems learned to \emph{reason}, or have they learned to \emph{pattern-match over reasoning traces}? This paper argues for a specific answer: \textbf{reasoning is iterative operator application in state spaces, converging to fixed points}. This definition is not merely philosophical – it has concrete architectural implications that explain both the failures of current systems and the path to genuine reasoning capabilities. Our investigation begins with a puzzle (OpenXOR), progresses through theory (OpenOperator), and culminates in a working solution (OpenLM) that achieves 76% accuracy where state-of-the-art LLMs achieve 0%. This is not about criticizing existing systems, but about \emph{understanding what reasoning requires} and \emph{building architectures that provide it}.

[858] Federated Learning for Pediatric Pneumonia Detection: Enabling Collaborative Diagnosis Without Sharing Patient Data

Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

Main category: cs.LG

TL;DR: Federated Learning enables multiple hospitals to collaboratively train pneumonia detection models from chest X-rays while keeping data private and local, achieving 47.5-50% performance gains over single-hospital models.

DetailsMotivation: Early pneumonia detection from chest X-rays is critical but hindered by distributed data, privacy regulations (HIPAA/GDPR), and high inter-hospital variability that make centralization impractical.

Method: Used Federated Learning with Sherpa.ai platform to simulate cross-hospital collaboration with non-IID data across multiple nodes, training CXR classifiers while keeping data in place.

Result: FL achieved 0.900 Accuracy and 0.966 ROC-AUC, representing 47.5% and 50.0% gains over single-hospital models (0.610; 0.644) without transferring any patient data.

Conclusion: FL delivers high-performing, generalizable, secure pneumonia detection across healthcare networks, enabling multi-institutional collaboration without data movement, especially valuable for rare diseases and low-data domains.

Abstract: Early and accurate pneumonia detection from chest X-rays (CXRs) is clinically critical to expedite treatment and isolation, reduce complications, and curb unnecessary antibiotic use. Although artificial intelligence (AI) substantially improves CXR-based detection, development is hindered by globally distributed data, high inter-hospital variability, and strict privacy regulations (e.g., HIPAA, GDPR) that make centralization impractical. These constraints are compounded by heterogeneous imaging protocols, uneven data availability, and the costs of transferring large medical images across geographically dispersed sites. In this paper, we evaluate Federated Learning (FL) using the Sherpa.ai FL platform, enabling multiple hospitals (nodes) to collaboratively train a CXR classifier for pneumonia while keeping data in place and private. Using the Pediatric Pneumonia Chest X-ray dataset, we simulate cross-hospital collaboration with non-independent and non-identically distributed (non-IID) data, reproducing real-world variability across institutions and jurisdictions. Our experiments demonstrate that collaborative and privacy-preserving training across multiple hospitals via FL led to a dramatic performance improvement achieving 0.900 Accuracy and 0.966 ROC-AUC, corresponding to 47.5% and 50.0% gains over single-hospital models (0.610; 0.644), without transferring any patient CXR. These results indicate that FL delivers high-performing, generalizable, secure and private pneumonia detection across healthcare networks, with data kept local. This is especially relevant for rare diseases, where FL enables secure multi-institutional collaboration without data movement, representing a breakthrough for accelerating diagnosis and treatment development in low-data domains.

[859] Multiscale Grassmann Manifolds for Single-Cell Data Analysis

Xiang Xiang Wang, Sean Cottrell, Guo-Wei Wei

Main category: cs.LG

TL;DR: A multiscale framework using Grassmann manifolds for single-cell data analysis that integrates machine learning with subspace geometry to better capture cellular heterogeneity and geometric structures.

DetailsMotivation: Conventional Euclidean space representations limit the ability to capture intrinsic correlations and multiscale geometric structures in single-cell data analysis.

Method: Proposes a multiscale framework based on Grassmann manifolds that generates embeddings under multiple representation scales and combines their features from different geometric views into a unified Grassmann manifold, using a power-based scale sampling function to control scale selection.

Result: Experiments on nine benchmark single-cell RNA-seq datasets show the approach effectively preserves meaningful structures and provides stable clustering performance, especially for small to medium-sized datasets.

Conclusion: Grassmann manifolds offer a coherent and informative foundation for analyzing single-cell data by better capturing intrinsic correlations and geometric structures.

Abstract: Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.

[860] Fast 3D Surrogate Modeling for Data Center Thermal Management

Soumyendu Sarkar, Antonio Guillen-Perez, Zachariah J Carmichael, Avisek Naug, Refik Mert Cam, Vineet Gundecha, Ashwin Ramesh Babu, Sahand Ghorbanpour, Ricardo Luna Gutierrez

Main category: cs.LG

TL;DR: Vision-based surrogate models for real-time 3D temperature prediction in data centers, achieving 20,000x speedup over traditional CFD solvers and enabling 7% energy savings through optimized cooling control.

DetailsMotivation: Reduce energy consumption and carbon emissions in data centers by enabling real-time temperature prediction for sustainability and operational efficiency, overcoming limitations of computationally expensive traditional CFD solvers.

Method: Develop vision-based surrogate modeling framework using 3D voxelized data center representation with server workloads, fan speeds, and HVAC set points. Evaluate multiple architectures: 3D CNN U-Net variants, 3D Fourier Neural Operator, and 3D vision transformers to map thermal inputs to heat maps.

Result: Surrogate models generalize across data center configurations and achieve up to 20,000x speedup (hundreds of milliseconds vs. hours). Enables real-time cooling control and workload redistribution.

Conclusion: Fast and accurate temperature estimation enables substantial energy savings (7%) and reduced carbon footprint through real-time thermal management in data centers.

Abstract: Reducing energy consumption and carbon emissions in data centers by enabling real-time temperature prediction is critical for sustainability and operational efficiency. Achieving this requires accurate modeling of the 3D temperature field to capture airflow dynamics and thermal interactions under varying operating conditions. Traditional thermal CFD solvers, while accurate, are computationally expensive and require expert-crafted meshes and boundary conditions, making them impractical for real-time use. To address these limitations, we develop a vision-based surrogate modeling framework that operates directly on a 3D voxelized representation of the data center, incorporating server workloads, fan speeds, and HVAC temperature set points. We evaluate multiple architectures, including 3D CNN U-Net variants, a 3D Fourier Neural Operator, and 3D vision transformers, to map these thermal inputs to high-fidelity heat maps. Our results show that the surrogate models generalize across data center configurations and achieve up to 20,000x speedup (hundreds of milliseconds vs. hours). This fast and accurate estimation of hot spots and temperature distribution enables real-time cooling control and workload redistribution, leading to substantial energy savings (7%) and reduced carbon footprint.

[861] Optimizing Input of Denoising Score Matching is Biased Towards Higher Score Norm

Tongda Xu

Main category: cs.LG

TL;DR: Optimizing conditional inputs or data distributions using denoising score matching in diffusion models introduces bias, breaking equivalence with exact score matching and increasing score norm.

DetailsMotivation: To investigate the bias introduced when using denoising score matching for optimization tasks in diffusion models, as many recent works rely on this approach.

Method: Analysis of denoising score matching optimization in diffusion models, examining the breakdown of equivalence with exact score matching and resulting score norm increase.

Result: Demonstrated that optimization breaks equivalence between denoising and exact score matching, leads to higher score norm, and affects various applications including MAR, PerCo, and DreamFusion.

Conclusion: The bias in denoising score matching optimization affects multiple domains and applications, highlighting a fundamental limitation in current approaches.

Abstract: Many recent works utilize denoising score matching to optimize the conditional input of diffusion models. In this workshop paper, we demonstrate that such optimization breaks the equivalence between denoising score matching and exact score matching. Furthermore, we show that this bias leads to higher score norm. Additionally, we observe a similar bias when optimizing the data distribution using a pre-trained diffusion model. Finally, we discuss the wide range of works across different domains that are affected by this bias, including MAR for auto-regressive generation, PerCo for image compression, and DreamFusion for text to 3D generation.

[862] Physics-Informed Neural ODEs with Scale-Aware Residuals for Learning Stiff Biophysical Dynamics

Kamalpreet Singh Kainth, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedat Panat

Main category: cs.LG

TL;DR: PI-NODE-SR combines neural ODEs with scale-aware residual normalization and explicit solvers to reliably forecast stiff biophysical systems, overcoming limitations of standard Neural ODEs that struggle with oscillatory dynamics.

DetailsMotivation: Standard Neural ODEs and physics-informed variants are unreliable for forecasting stiff biophysical systems, requiring excessive iterations and often converging to suboptimal solutions that fail to preserve oscillatory frequency or amplitude.

Method: PhysicsInformed Neural ODEs with Scale-Aware Residuals (PI-NODE-SR) combines low-order explicit solver (Heun method) residual normalization to balance contributions between state variables evolving on disparate timescales, stabilizing training under realistic iteration budgets.

Result: On Hodgkin-Huxley equations, PI-NODE-SR learns from a single oscillation and extrapolates beyond 100 ms, capturing oscillation frequency and near-correct amplitudes. It recovers morphological features like sharp subthreshold curvature typically reserved for higher-order solvers.

Conclusion: PI-NODE-SR consistently reduces long-horizon errors relative to baseline Neural-ODEs and PINNs, offering a principled route to stable and efficient learning of stiff biological dynamics, though performance remains sensitive to initialization.

Abstract: Neural differential equations offer a powerful framework for modeling continuous-time dynamics, but forecasting stiff biophysical systems remains unreliable. Standard Neural ODEs and physics informed variants often require orders of magnitude more iterations, and even then may converge to suboptimal solutions that fail to preserve oscillatory frequency or amplitude. We introduce PhysicsInformed Neural ODEs with with Scale-Aware Residuals (PI-NODE-SR), a framework that combines a low-order explicit solver (Heun method) residual normalisation to balance contributions between state variables evolving on disparate timescales. This combination stabilises training under realistic iteration budgets and avoids reliance on computationally expensive implicit solvers. On the Hodgkin-Huxley equations, PI-NODE-SR learns from a single oscillation simulated with a stiff solver (Rodas5P) and extrapolates beyond 100 ms, capturing both oscillation frequency and near-correct amplitudes. Remarkably, end-to-end learning of the vector field enables PI-NODE-SR to recover morphological features such as sharp subthreshold curvature in gating variables that are typically reserved for higher-order solvers, suggesting that neural correction can offset numerical diffusion. While performance remains sensitive to initialisation, PI-NODE-SR consistently reduces long-horizon errors relative to baseline Neural-ODEs and PINNs, offering a principled route to stable and efficient learning of stiff biological dynamics.

[863] KAN/H: Kolmogorov-Arnold Network using Haar-like bases

Susumu Katayama

Main category: cs.LG

TL;DR: KAN/H is a variant of Kolmogorov-Arnold Network that uses Haar-variant basis system instead of B-spline, requiring minimal hyper-parameter tuning for function approximation and MNIST tasks.

DetailsMotivation: To develop a more efficient Kolmogorov-Arnold Network variant that reduces the need for extensive hyper-parameter tuning while maintaining performance.

Method: Replaced B-spline basis system with Haar-variant basis system that incorporates both global and local bases in the KAN architecture.

Result: The proposed KAN/H achieves good performance on function approximation problems and MNIST classification without requiring most problem-specific hyper-parameter tunings.

Conclusion: KAN/H with Haar-variant basis system provides an effective alternative to traditional KAN networks, offering simplified implementation through reduced hyper-parameter dependency.

Abstract: This paper proposes KAN/H, a variant of Kolmogorov-Arnold Network (KAN) that uses a Haar-variant basis system having both global and local bases instead of B-spline. The resulting algorithm is applied to function approximation problems and MNIST. We show that it does not require most of the problem-specific hyper-parameter tunings.

[864] DK-Root: A Joint Data-and-Knowledge-Driven Framework for Root Cause Analysis of QoE Degradations in Mobile Networks

Qizhe Li, Haolong Chen, Jiansheng Li, Shuqi Chai, Xuan Li, Yuzhou Hou, Xinhua Shao, Fangfang Li, Kaifeng Han, Guangxu Zhu

Main category: cs.LG

TL;DR: DK-Root is a joint data-and-knowledge-driven framework that combines scalable weak supervision with expert guidance for robust root-cause analysis of QoE degradations in mobile networks, outperforming traditional methods through contrastive learning and conditional diffusion augmentation.

DetailsMotivation: Diagnosing QoE degradations in mobile networks is challenging due to complex cross-layer KPI interactions and scarce expert annotations. Rule-based heuristics provide scalable labels but are noisy and coarse-grained, limiting purely data-driven approaches.

Method: DK-Root pretrains an encoder via contrastive representation learning using rule-based labels while denoising noise through supervised contrastive objective. It introduces a class-conditional diffusion model for task-faithful KPI sequence augmentation, and jointly fine-tunes encoder and classifier with expert-verified labels.

Result: Extensive experiments on real-world operator-grade dataset show state-of-the-art accuracy, with DK-Root surpassing traditional ML and recent semi-supervised time-series methods. Ablations confirm the necessity of conditional diffusion augmentation and pretrain-finetune design.

Conclusion: DK-Root effectively addresses the challenges of QoE root-cause analysis by unifying scalable weak supervision with precise expert guidance, demonstrating superior performance through its contrastive learning and conditional diffusion augmentation approach.

Abstract: Diagnosing the root causes of Quality of Experience (QoE) degradations in operational mobile networks is challenging due to complex cross-layer interactions among kernel performance indicators (KPIs) and the scarcity of reliable expert annotations. Although rule-based heuristics can generate labels at scale, they are noisy and coarse-grained, limiting the accuracy of purely data-driven approaches. To address this, we propose DK-Root, a joint data-and-knowledge-driven framework that unifies scalable weak supervision with precise expert guidance for robust root-cause analysis. DK-Root first pretrains an encoder via contrastive representation learning using abundant rule-based labels while explicitly denoising their noise through a supervised contrastive objective. To supply task-faithful data augmentation, we introduce a class-conditional diffusion model that generates KPIs sequences preserving root-cause semantics, and by controlling reverse diffusion steps, it produces weak and strong augmentations that improve intra-class compactness and inter-class separability. Finally, the encoder and the lightweight classifier are jointly fine-tuned with scarce expert-verified labels to sharpen decision boundaries. Extensive experiments on a real-world, operator-grade dataset demonstrate state-of-the-art accuracy, with DK-Root surpassing traditional ML and recent semi-supervised time-series methods. Ablations confirm the necessity of the conditional diffusion augmentation and the pretrain-finetune design, validating both representation quality and classification gains.

[865] Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

Sebastián Andrés Cajas Ordóñez, Luis Fernando Torres Torres, Mackenzie J. Meni, Carlos Andrés Duran Paredes, Eric Arazo, Cristian Bosch, Ricardo Simon Carbajo, Yuan Lai, Leo Anthony Celi

Main category: cs.LG

TL;DR: A curiosity-driven quantized Mixture-of-Experts framework achieves 99.9% of 16-bit accuracy with 4-bit quantization, 4x compression, 41% energy savings, and 82% reduction in latency variance for edge deployment.

DetailsMotivation: Deploying deep neural networks on resource-constrained devices requires maintaining accuracy under aggressive quantization while ensuring predictable inference latency, especially since deployment emissions dominate training by 10000x for models serving over 1,000 inferences.

Method: Bayesian epistemic uncertainty-based routing across heterogeneous experts including BitNet ternary, 1-16 bit BitLinear, and post-training quantization, evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K).

Result: 4-bit quantization maintains 0.858 F1 score (99.9% of 16-bit accuracy), achieves 4x compression, 41% energy savings vs 8-bit, and reduces MoE latency variance by 82% (from 230 ms to 29 ms standard deviation).

Conclusion: Simple 4-bit quantized architectures outperform complex MoE for most deployments, demonstrating that adaptive quantization yields accurate, energy-efficient, and predictable edge models with practical equivalence to full precision.

Abstract: Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene’s test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

[866] Diffusion Models: A Mathematical Introduction

Sepehr Maleki, Negar Pourmoazemi

Main category: cs.LG

TL;DR: A comprehensive derivation of diffusion-based generative models from first principles, covering forward/reverse processes, variational bounds, sampling methods, continuous-time formulations, and guidance techniques.

DetailsMotivation: To provide a transparent, self-contained mathematical foundation for diffusion models that bridges theory and implementation, making the complex concepts accessible through clear algebra and consistent notation.

Method: Systematic derivation starting from Gaussian distribution properties, constructing denoising diffusion probabilistic models, variational bounds, continuous-time formulations via SDE/ODE, and various sampling acceleration techniques including DDIM, DDGAN, and flow matching.

Result: Establishes complete theoretical framework showing how standard noise-prediction objectives emerge naturally from variational principles, connects discrete and continuous formulations, and provides unified understanding of guidance methods.

Conclusion: The paper successfully demonstrates that diffusion models can be systematically derived from first principles, providing both theoretical clarity and practical implementation guidance through transparent mathematical exposition.

Abstract: We present a concise, self-contained derivation of diffusion-based generative models. Starting from basic properties of Gaussian distributions (densities, quadratic expectations, re-parameterisation, products, and KL divergences), we construct denoising diffusion probabilistic models from first principles. This includes the forward noising process, its closed-form marginals, the exact discrete reverse posterior, and the related variational bound. This bound simplifies to the standard noise-prediction goal used in practice. We then discuss likelihood estimation and accelerated sampling, covering DDIM, adversarially learned reverse dynamics (DDGAN), and multi-scale variants such as nested and latent diffusion, with Stable Diffusion as a canonical example. A continuous-time formulation follows, in which we derive the probability-flow ODE from the diffusion SDE via the continuity and Fokker-Planck equations, introduce flow matching, and show how rectified flows recover DDIM up to a time re-parameterisation. Finally, we treat guided diffusion, interpreting classifier guidance as a posterior score correction and classifier-free guidance as a principled interpolation between conditional and unconditional scores. Throughout, the focus is on transparent algebra, explicit intermediate steps, and consistent notation, so that readers can both follow the theory and implement the corresponding algorithms in practice.

[867] IDOL: Meeting Diverse Distribution Shifts with Prior Physics for Tropical Cyclone Multi-Task Estimation

Hanting Yan, Pan Mu, Shiqi Zhang, Yuchao Zhu, Jinglin Zhang, Cong Bai

Main category: cs.LG

TL;DR: IDOL is a physical invariant learning framework for tropical cyclone estimation that uses identity-oriented constraints guided by prior physical knowledge to handle distribution shifts in TC environmental fields.

DetailsMotivation: Distribution shifts in tropical cyclone environmental fields due to geographical and seasonal variations challenge reliable TC estimation. Existing methods overlook feature representation distributions, leading to poor generalization in out-of-distribution scenarios.

Method: IDOL employs wind field models and dark correlation knowledge to model task-shared and task-specific identity tokens that capture task dependencies and physical invariances, imposing identity-oriented constraints to regulate the feature space.

Result: Extensive experiments show IDOL outperforms existing methods in estimating TC wind speed, pressure, and inner/outer-core size under distribution shifts across multiple datasets and tasks.

Conclusion: Imposing identity-oriented constraints based on prior physical knowledge effectively mitigates diverse distribution shifts in tropical cyclone estimation.

Abstract: Tropical Cyclone (TC) estimation aims to accurately estimate various TC attributes in real time. However, distribution shifts arising from the complex and dynamic nature of TC environmental fields, such as varying geographical conditions and seasonal changes, present significant challenges to reliable estimation. Most existing methods rely on multi-modal fusion for feature extraction but overlook the intrinsic distribution of feature representations, leading to poor generalization under out-of-distribution (OOD) scenarios. To address this, we propose an effective Identity Distribution-Oriented Physical Invariant Learning framework (IDOL), which imposes identity-oriented constraints to regulate the feature space under the guidance of prior physical knowledge, thereby dealing distribution variability with physical invariance. Specifically, the proposed IDOL employs the wind field model and dark correlation knowledge of TC to model task-shared and task-specific identity tokens. These tokens capture task dependencies and intrinsic physical invariances of TC, enabling robust estimation of TC wind speed, pressure, inner-core, and outer-core size under distribution shifts. Extensive experiments conducted on multiple datasets and tasks demonstrate the outperformance of the proposed IDOL, verifying that imposing identity-oriented constraints based on prior physical knowledge can effectively mitigates diverse distribution shifts in TC estimation.Code is available at https://github.com/Zjut-MultimediaPlus/IDOL.

[868] Improving a Hybrid Graphsage Deep Network for Automatic Multi-objective Logistics Management in Supply Chain

Mehdi Khaleghi, Nastaran Khaleghi, Sobhan Sheykhivand, Sebelan Danishvar

Main category: cs.LG

TL;DR: A hybrid GraphSAGE network (H-GSN) is proposed for multi-task logistics management, achieving high accuracy (97.8%-100%) in predicting shipment types, logistics delays, traffic status, and logistics IDs across three supply chain datasets.

DetailsMotivation: To improve supply chain resiliency and sustainability through efficient logistics management, including reducing air pollutant emissions and enhancing collaboration with logistics service providers by automating prediction of shipment types, logistics delays, and traffic status.

Method: Proposed a hybrid GraphSAGE network (H-GSN) for multi-task logistics management, trained on three different supply chain databases (DataCo, Shipping, and Smart Logistics from Kaggle) to predict shipment type, shipment status, traffic status, logistics ID, and logistics delay.

Result: Achieved average accuracy of 97.8% for logistics ID prediction (10 types) and 100% for traffic status prediction (3 types) on Smart Logistics dataset; 98.7% for shipment type prediction on DataCo dataset; and 99.4% for logistics delay prediction on Shipping dataset.

Conclusion: The proposed H-GSN method effectively improves supply chain resilience and sustainability through accurate multi-task predictions in logistics management, as confirmed by evaluation metrics across different logistics scenarios.

Abstract: Systematic logistics, conveyance amenities and facilities as well as warehousing information play a key role in fostering profitable development in a supply chain. The aim of transformation in industries is the improvement of the resiliency regarding the supply chain. The resiliency policies are required for companies to affect the collaboration with logistics service providers positively. The decrement of air pollutant emissions is a persistent advantage of the efficient management of logistics and transportation in supply chain. The management of shipment type is a significant factor in analyzing the sustainability of logistics and supply chain. An automatic approach to predict the shipment type, logistics delay and traffic status are required to improve the efficiency of the supply chain management. A hybrid graphsage network (H-GSN) is proposed in this paper for multi-task purpose of logistics management in a supply chain. The shipment type, shipment status, traffic status, logistics ID and logistics delay are the objectives in this article regarding three different databases including DataCo, Shipping and Smart Logistcis available on Kaggle as supply chain logistics databases. The average accuracy of 97.8% and 100% are acquired for 10 kinds of logistics ID and 3 types of traffic status prediction in Smart Logistics dataset. The average accuracy of 98.7% and 99.4% are obtained for shipment type prediction in DataCo and logistics delay in Shipping database, respectively. The evaluation metrics for different logistics scenarios confirm the efficiency of the proposed method to improve the resilience and sustainability of the supply chain.

[869] Sumudu Neural Operator for ODEs and PDEs

Ben Zelenskiy, Saibilila Abudukelimu, George Flint, Kevin Zhu, Sunishchal Dev

Main category: cs.LG

TL;DR: SNO is a novel neural operator based on Sumudu Transform that achieves superior performance to FNO on PDEs and competitive accuracy with LNO, particularly excelling on Euler-Bernoulli Beam and Diffusion Equation tasks.

DetailsMotivation: To develop a neural operator leveraging the mathematical properties of Sumudu Transform for solving differential equations more effectively than existing methods.

Method: Decompose input space as coefficients using polynomial expansions of transform pairs, transform them into Sumudu Space, and parameterize the neural operator in this transformed space.

Result: SNO outperforms FNO on PDEs and shows competitive accuracy with LNO, achieving lowest error on Euler-Bernoulli Beam and Diffusion Equation. Successfully demonstrates zero-shot super-resolution capabilities.

Conclusion: The Sumudu Transform shows promise as a neural operator design, particularly for certain classes of PDEs, offering improved performance over existing methods.

Abstract: We introduce the Sumudu Neural Operator (SNO), a neural operator rooted in the properties of the Sumudu Transform. We leverage the relationship between the polynomial expansions of transform pairs to decompose the input space as coefficients, which are then transformed into the Sumudu Space, where the neural operator is parameterized. We evaluate the operator in ODEs (Duffing Oscillator, Lorenz System, and Driven Pendulum) and PDEs (Euler-Bernoulli Beam, Burger’s Equation, Diffusion, Diffusion-Reaction, and Brusselator). SNO achieves superior performance to FNO on PDEs and demonstrates competitive accuracy with LNO on several PDE tasks, including the lowest error on the Euler-Bernoulli Beam and Diffusion Equation. Additionally, we apply zero-shot super-resolution to the PDE tasks to observe the model’s capability of obtaining higher quality data from low-quality samples. These preliminary findings suggest promise for the Sumudu Transform as a neural operator design, particularly for certain classes of PDEs.

[870] Learning Fair Representations with Kolmogorov-Arnold Networks

Amisha Priyadarshini, Sergio Gago-Masague

Main category: cs.LG

TL;DR: The paper proposes using Kolmogorov-Arnold Networks (KANs) in a fair adversarial learning framework with adaptive penalty updates to balance fairness and accuracy in college admissions, outperforming baseline models.

DetailsMotivation: Predictive models often exhibit discrimination towards marginalized groups due to biased data or design, creating challenges in high-stakes domains like college admissions. Existing fair learning models struggle with the fairness-accuracy trade-off and lack interpretability.

Method: Integrates Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework with an adaptive penalty update mechanism that dynamically adjusts fairness constraints during training.

Result: Experiments on two real-world college admissions datasets show KANs consistently outperform baseline fair learning models, maintaining high predictive accuracy while achieving competitive fairness across sensitive attributes.

Conclusion: The proposed approach effectively balances fairness and accuracy using KANs’ adversarial robustness and interpretability, making it suitable for socially sensitive domains like college admissions.

Abstract: Despite recent advances in fairness-aware machine learning, predictive models often exhibit discriminatory behavior towards marginalized groups. Such unfairness might arise from biased training data, model design, or representational disparities across groups, posing significant challenges in high-stakes decision-making domains such as college admissions. While existing fair learning models aim to mitigate bias, achieving an optimal trade-off between fairness and accuracy remains a challenge. Moreover, the reliance on black-box models hinders interpretability, limiting their applicability in socially sensitive domains. In this paper, we try to circumvent these issues by integrating Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework. Leveraging the adversarial robustness and interpretability of KANs, our approach enables a balance between fairness and accuracy. To further facilitate this balance, we propose an adaptive penalty update mechanism that dynamically adjusts fairness constraints during the model training. We conduct numerical experiments on two real-world college admissions datasets, across three different optimization strategies. The results demonstrate the efficiency and robustness of KANs by consistently outperforming the baseline fair learning models, and maintaining high predictive accuracy while achieving competitive fairness across sensitive attributes.

[871] CATCHFed: Efficient Unlabeled Data Utilization for Semi-Supervised Federated Learning in Limited Labels Environments

Byoungjun Park, Pedro Porto Buarque de Gusmão, Dongjin Ji, Minhoe Kim

Main category: cs.LG

TL;DR: CATCHFed improves semi-supervised federated learning by using adaptive thresholds for pseudo-labeling and consistency regularization, achieving better performance with limited labeled data.

DetailsMotivation: Real-world federated learning scenarios often lack client-side labeled data, and existing semi-supervised FL methods suffer performance degradation when labeled data is scarce.

Method: Proposes client-aware adaptive thresholds considering class difficulty, hybrid thresholds for better pseudo-label quality, and uses unpseudo-labeled data for consistency regularization.

Result: Extensive experiments show CATCHFed effectively leverages unlabeled client data and achieves superior performance in extremely limited-label settings across various datasets.

Conclusion: CATCHFed successfully addresses the challenge of limited labeled data in semi-supervised federated learning through adaptive thresholding and consistency regularization techniques.

Abstract: Federated learning is a promising paradigm that utilizes distributed client resources while preserving data privacy. Most existing FL approaches assume clients possess labeled data, however, in real-world scenarios, client-side labels are often unavailable. Semi-supervised Federated learning, where only the server holds labeled data, addresses this issue. However, it experiences significant performance degradation as the number of labeled data decreases. To tackle this problem, we propose \textit{CATCHFed}, which introduces client-aware adaptive thresholds considering class difficulty, hybrid thresholds to enhance pseudo-label quality, and utilizes unpseudo-labeled data for consistency regularization. Extensive experiments across various datasets and configurations demonstrate that CATCHFed effectively leverages unlabeled client data, achieving superior performance even in extremely limited-label settings.

[872] Coordinate Descent for Network Linearization

Vlad Rakhlin, Amir Jevnisek, Shai Avidan

Main category: cs.LG

TL;DR: Proposes a coordinate descent method to reduce ReLU activations in private inference, achieving state-of-the-art performance by directly optimizing in the discrete domain rather than using smooth approximations.

DetailsMotivation: ReLU activations are the main bottleneck in private inference using ResNet networks due to significant latency. Current methods using smooth approximations suffer from large performance loss during hard thresholding.

Method: Leverages Coordinate Descent as optimization framework that works directly in the discrete domain, yielding sparse solutions by design without requiring thresholding steps.

Result: Extensive experiments demonstrate that the method achieves State of the Art performance on common benchmarks for reducing ReLU count while maintaining accuracy.

Conclusion: The coordinate descent approach provides an effective alternative to smooth approximation methods for discrete optimization of ReLU reduction in private inference, achieving superior performance through direct discrete optimization.

Abstract: ReLU activations are the main bottleneck in Private Inference that is based on ResNet networks. This is because they incur significant inference latency. Reducing ReLU count is a discrete optimization problem, and there are two common ways to approach it. Most current state-of-the-art methods are based on a smooth approximation that jointly optimizes network accuracy and ReLU budget at once. However, the last hard thresholding step of the optimization usually introduces a large performance loss. We take an alternative approach that works directly in the discrete domain by leveraging Coordinate Descent as our optimization framework. In contrast to previous methods, this yields a sparse solution by design. We demonstrate, through extensive experiments, that our method is State of the Art on common benchmarks.

[873] Simplicial covering dimension of extremal concept classes

Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

Main category: cs.LG

TL;DR: The paper adapts topological dimension theory to binary concept classes, defining a simplicial covering dimension that exactly characterizes list replicability in PAC learning for finite concept classes.

DetailsMotivation: To bridge classical topological dimension theory with computational learning theory, specifically to understand list replicability (global stability) in PAC learning through topological methods.

Method: Define a simplicial covering dimension for binary concept classes by associating them with topological spaces of realizable distributions and inducing simplicial structures through loss functions.

Result: Proved that for finite concept classes, the simplicial covering dimension exactly equals the list replicability number, enabling computation of replicability numbers for extremal concept classes using classical dimension theory tools.

Conclusion: Topological dimension theory provides a powerful framework for characterizing and computing list replicability in PAC learning, establishing a deep connection between topology and learning theory.

Abstract: Dimension theory is a branch of topology concerned with defining and analyzing dimensions of geometric and topological spaces in purely topological terms. In this work, we adapt the classical notion of topological dimension (Lebesgue covering) to binary concept classes. The topological space naturally associated with a concept class is its space of realizable distributions. The loss function and the class itself induce a simplicial structure on this space, with respect to which we define a simplicial covering dimension. We prove that for finite concept classes, this simplicial covering dimension exactly characterizes the list replicability number (equivalently, global stability) in PAC learning. This connection allows us to apply tools from classical dimension theory to compute the exact list replicability number of the broad family of extremal concept classes.

[874] Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

Wenwen Si, Sooyong Jang, Insup Lee, Osbert Bastani

Main category: cs.LG

TL;DR: CCPO combines multiple LLMs with different cost/accuracy tradeoffs using conformal prediction to minimize costs while maintaining reliability guarantees.

DetailsMotivation: Address the increasingly steep computational and API costs of large language models while maintaining reliability.

Method: Conformal Constrained Policy Optimization (CCPO) - integrates constrained policy optimization with off-policy reinforcement learning and online conformal prediction to jointly optimize cost-aware policy and adaptive threshold.

Result: Achieves up to 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability on multi-hop question answering benchmarks.

Conclusion: Provides a principled and practical framework for deploying cost-effective LLM agents while maintaining reliability through conformal prediction guarantees.

Abstract: While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user-specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off-policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost-aware policy (score function) and an adaptive threshold. Across two multi-hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost-effective while maintaining reliability.

[875] Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers

Vahid Hemmati, Ahmad Mohammadi, Abdul-Rauf Nuhu, Reza Ahmari, Parham Kebria, Abdollah Homaifar

Main category: cs.LG

TL;DR: VC (Volatility in Certainty) is a label-free metric that measures dispersion in sorted softmax outputs, showing strong negative correlation with classification accuracy and effectiveness in detecting adversarial drift without ground-truth labels.

DetailsMotivation: Adversarial robustness is critical for real-time systems where ground-truth labels are unavailable during inference, requiring label-free metrics to detect performance degradation.

Method: Evaluated VC metric on ANNs and CNNs trained on MNIST, and VGG-like model on CIFAR-10. Generated adversarial examples using FGSM with varying perturbation magnitudes and created mixed test sets with incremental adversarial contamination.

Result: Strong negative correlation between classification accuracy and log(VC) (rho < -0.90 in most cases), showing VC effectively reflects performance degradation without labeled data.

Conclusion: VC is a scalable, architecture-agnostic, real-time performance metric suitable for early-warning systems in safety-critical applications.

Abstract: Adversarial robustness remains a critical challenge in deploying neural network classifiers, particularly in real-time systems where ground-truth labels are unavailable during inference. This paper investigates \textit{Volatility in Certainty} (VC), a recently proposed, label-free metric that quantifies irregularities in model confidence by measuring the dispersion of sorted softmax outputs. Specifically, VC is defined as the average squared log-ratio of adjacent certainty values, capturing local fluctuations in model output smoothness. We evaluate VC as a proxy for classification accuracy and as an indicator of adversarial drift. Experiments are conducted on artificial neural networks (ANNs) and convolutional neural networks (CNNs) trained on MNIST, as well as a regularized VGG-like model trained on CIFAR-10. Adversarial examples are generated using the Fast Gradient Sign Method (FGSM) across varying perturbation magnitudes. In addition, mixed test sets are created by gradually introducing adversarial contamination to assess VC’s sensitivity under incremental distribution shifts. Our results reveal a strong negative correlation between classification accuracy and log(VC) (correlation rho < -0.90 in most cases), suggesting that VC effectively reflects performance degradation without requiring labeled data. These findings position VC as a scalable, architecture-agnostic, and real-time performance metric suitable for early-warning systems in safety-critical applications.

[876] On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

Lucas Fenaux, Christopher Srinivasa, Florian Kerschbaum

Main category: cs.LG

TL;DR: Transparency in AI systems can conflict with security in adversarial settings, as attackers are more successful when they know whether a defender’s model is defended or not.

DetailsMotivation: To investigate the strategic effect of transparency in adversarial machine learning settings, specifically examining the trade-off between transparency and security in transferable adversarial example attacks.

Method: Conducted large-scale empirical evaluation of 9 attacks across 181 models, then used game theory to model the problem as both Nash and Stackelberg games to analyze expected outcomes.

Result: Attackers are more successful when they match the defender’s decision about using defended or undefended models, suggesting obscurity could benefit defenders. Knowing whether a model is defended can sometimes be enough to damage its security.

Conclusion: Transparency in AI systems can be at odds with security, and game-theoretic reasoning helps uncover conflicts between these two Responsible AI principles in adversarial settings.

Abstract: Transparency and security are both central to Responsible AI, but they may conflict in adversarial settings. We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks. In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender’s target model. These models can be defended or undefended, with both players having to decide which to use. Using a large-scale empirical evaluation of nine attacks across 181 models, we find that attackers are more successful when they match the defender’s decision; hence, obscurity could be beneficial to the defender. With game theory, we analyze this trade-off between transparency and security by modeling this problem as both a Nash game and a Stackelberg game, and comparing the expected outcomes. Our analysis confirms that only knowing whether a defender’s model is defended or not can sometimes be enough to damage its security. This result serves as an indicator of the general trade-off between transparency and security, suggesting that transparency in AI systems can be at odds with security. Beyond adversarial machine learning, our work illustrates how game-theoretic reasoning can uncover conflicts between transparency and security.

[877] Leveraging Exogenous Signals for Hydrology Time Series Forecasting

Junyang He, Judy Fox, Alireza Jafari, Ying-Jung Chen, Geoffrey Fox

Main category: cs.LG

TL;DR: Time series foundation models underperform domain-specific models in hydrological rainfall-runoff modeling when comprehensive exogenous inputs are incorporated.

DetailsMotivation: To evaluate the effectiveness of time series foundation models in physical science applications, specifically hydrological modeling, and compare them with domain-knowledge-integrated approaches.

Method: Used CAMELS-US dataset with 671 locations, comparing baseline models, foundation models, and domain-knowledge-integrated approaches that incorporate rainfall, runoff data, six time series streams, and 30 static features.

Result: Models with comprehensive known exogenous inputs outperformed limited approaches including foundation models, with natural annual periodic time series providing the most significant improvements.

Conclusion: Domain-specific knowledge integration, particularly natural annual periodic patterns, is crucial for accurate hydrological modeling and outperforms general foundation models in this physical science application.

Abstract: Recent advances in time series research facilitate the development of foundation models. While many state-of-the-art time series foundation models have been introduced, few studies examine their effectiveness in specific downstream applications in physical science. This work investigates the role of integrating domain knowledge into time series models for hydrological rainfall-runoff modeling. Using the CAMELS-US dataset, which includes rainfall and runoff data from 671 locations with six time series streams and 30 static features, we compare baseline and foundation models. Results demonstrate that models incorporating comprehensive known exogenous inputs outperform more limited approaches, including foundation models. Notably, incorporating natural annual periodic time series contribute the most significant improvements.

[878] Transformers vs. Recurrent Models for Estimating Forest Gross Primary Production

David Montero, Miguel D. Mahecha, Francesco Martinuzzi, César Aybar, Anne Klosterhalfen, Alexander Knohl, Jesús Anaya, Clemens Mosig, Sebastian Wieneke

Main category: cs.LG

TL;DR: This paper compares transformer (GPT-2) and LSTM models for predicting forest CO2 uptake using multimodal remote sensing data, finding LSTM performs better overall but GPT-2 excels during extreme events, with different context length requirements.

DetailsMotivation: To overcome limitations of traditional methods for monitoring forest CO2 uptake (GPP) - EC towers have limited spatial coverage, while single-sensor remote sensing approaches often fail to capture complex temporal dynamics.

Method: Used two deep learning models (GPT-2 transformer and LSTM) with multivariate inputs from multiple remote sensing sources (Sentinel-2, MODIS, Sentinel-1) to predict GPP, analyzing performance, temporal context length, and feature importance.

Result: Both models achieved similar accuracy overall, with LSTM performing slightly better but GPT-2 outperforming during extreme events. LSTM required significantly shorter input windows than GPT-2 for similar performance. Radiation was the most important predictor, followed by Sentinel-2, MODIS temperature, and Sentinel-1 data.

Conclusion: Model architecture, context length, and multimodal inputs jointly determine GPP prediction performance, providing guidance for developing future deep learning frameworks for terrestrial carbon monitoring.

Abstract: Monitoring the spatiotemporal dynamics of forest CO$_2$ uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.

[879] Better LLM Reasoning via Dual-Play

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

Main category: cs.LG

TL;DR: PasoDoble is a novel dual-play framework that adversarially trains two LLMs - a Proposer generating challenging questions and a Solver answering them - without external supervision, improving reasoning performance while addressing reward hacking and training instability.

DetailsMotivation: Current LLM training relies heavily on external supervision (e.g., curated labels). Adversarial learning through self-play offers an alternative to reduce this dependency, but adapting dual-play to LLMs is limited due to reward hacking and training instability issues.

Method: PasoDoble trains two models from the same base: a Proposer generates questions with ground-truth answers using pre-training dataset knowledge, and a Solver attempts to solve them. The Proposer is rewarded for valid questions that challenge the Solver, while the Solver is rewarded for correct answers. An optional offline paradigm decouples their updates for stability.

Result: Experimental results demonstrate that PasoDoble can improve the reasoning performance of LLMs without requiring supervision during training.

Conclusion: PasoDoble successfully implements dual-play adversarial training for LLMs, enabling unsupervised improvement of reasoning capabilities while effectively addressing reward hacking and training stability challenges.

Abstract: Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.

[880] FLEX: Feature Importance from Layered Counterfactual Explanations

Nawid Keshtmand, Roussel Desmond Nzoyem, Jeffrey Nicholas Clark

Main category: cs.LG

TL;DR: FLEX converts counterfactual explanations into feature importance scores at local, regional, and global levels, bridging local recourse with global attribution for interpretable ML.

DetailsMotivation: Current counterfactual explanations are instance-specific and don't quantify which features systematically drive outcome changes across datasets or feature space regions.

Method: FLEX framework aggregates counterfactual feature changes into frequency scores across instances and neighborhoods, compatible with various counterfactual generation methods.

Result: FLEX’s global rankings correlate with SHAP while identifying additional drivers, and regional analyses reveal context-specific factors missed by global summaries.

Conclusion: FLEX bridges local recourse and global attribution, supporting transparent, intervention-oriented decision-making in risk-sensitive applications.

Abstract: Machine learning models achieve state-of-the-art performance across domains, yet their lack of interpretability limits safe deployment in high-stakes settings. Counterfactual explanations are widely used to provide actionable “what-if” recourse, but they typically remain instance-specific and do not quantify which features systematically drive outcome changes within coherent regions of the feature space or across an entire dataset. We introduce FLEX (Feature importance from Layered counterfactual EXplanations), a model- and domain-agnostic framework that converts sets of counterfactuals into feature change frequency scores at local, regional, and global levels. FLEX generalises local change-frequency measures by aggregating across instances and neighbourhoods, offering interpretable rankings that reflect how often each feature must change to flip predictions. The framework is compatible with different counterfactual generation methods, allowing users to emphasise characteristics such as sparsity, feasibility, or actionability, thereby tailoring the derived feature importances to practical constraints. We evaluate FLEX on two contrasting tabular tasks: traffic accident severity prediction and loan approval, and compare FLEX to SHAP- and LIME-derived feature importance values. Results show that (i) FLEX’s global rankings correlate with SHAP while surfacing additional drivers, and (ii) regional analyses reveal context-specific factors that global summaries miss. FLEX thus bridges the gap between local recourse and global attribution, supporting transparent and intervention-oriented decision-making in risk-sensitive applications.

[881] Chain-of-Generation: Progressive Latent Diffusion for Text-Guided Molecular Design

Lingxiao Li, Haobo Zhang, Bin Chen, Jiayu Zhou

Main category: cs.LG

TL;DR: Chain-of-Generation (CoG) is a multi-stage latent diffusion framework that decomposes prompts into curriculum-ordered segments and progressively incorporates them during generation, addressing limitations of one-shot conditioning in text-to-molecule generation.

DetailsMotivation: Existing diffusion-based methods use one-shot conditioning where the entire prompt is encoded once, making it difficult to satisfy all requirements in complex prompts and leading to poor interpretability, incomplete substructure generation, and overambition.

Method: CoG decomposes prompts into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals during denoising. It includes a post-alignment learning phase to strengthen correspondence between textual and molecular latent spaces.

Result: Extensive experiments show CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex compositional prompts with transparent generation insights.

Conclusion: CoG effectively addresses the challenges of one-shot conditioning by using progressive semantic guidance, enabling better generation of molecules that satisfy complex linguistic constraints while providing interpretable generation processes.

Abstract: Text-conditioned molecular generation aims to translate natural-language descriptions into chemical structures, enabling scientists to specify functional groups, scaffolds, and physicochemical constraints without handcrafted rules. Diffusion-based models, particularly latent diffusion models (LDMs), have recently shown promise by performing stochastic search in a continuous latent space that compactly captures molecular semantics. Yet existing methods rely on one-shot conditioning, where the entire prompt is encoded once and applied throughout diffusion, making it hard to satisfy all the requirements in the prompt. We discuss three outstanding challenges of one-shot conditioning generation, including the poor interpretability of the generated components, the failure to generate all substructures, and the overambition in considering all requirements simultaneously. We then propose three principles to address those challenges, motivated by which we propose Chain-of-Generation (CoG), a training-free multi-stage latent diffusion framework. CoG decomposes each prompt into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals, guiding the denoising trajectory toward molecules that satisfy increasingly rich linguistic constraints. To reinforce semantic guidance, we further introduce a post-alignment learning phase that strengthens the correspondence between textual and molecular latent spaces. Extensive experiments on benchmark and real-world tasks demonstrate that CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex, compositional prompts while offering transparent insight into the generation process.

[882] Robust Bidirectional Associative Memory via Regularization Inspired by the Subspace Rotation Algorithm

Ci Lin, Tet Yeap, Iluju Kiringa, Biwei Zhang

Main category: cs.LG

TL;DR: Proposed Bidirectional Subspace Rotation Algorithm (B-SRA) for robust BAM training, identifying orthogonal weight matrices and gradient-pattern alignment as key robustness principles, with SAME configuration achieving strongest resilience.

DetailsMotivation: BAM trained with B-BP suffers from poor robustness and high sensitivity to noise/adversarial attacks, requiring more resilient training methods.

Method: Introduced gradient-free B-SRA algorithm and new regularization strategies (OWM and GPA) into B-BP, with ablation studies across training strategies and various memory capacities.

Result: SAME configuration integrating both OWM and GPA achieved strongest resilience against corruption and adversarial perturbations across different attack scenarios and memory capacities.

Conclusion: B-SRA and proposed regularization strategies lead to substantially more robust associative memories, opening new directions for resilient neural architectures.

Abstract: Bidirectional Associative Memory (BAM) trained with Bidirectional Backpropagation (B-BP) often suffers from poor robustness and high sensitivity to noise and adversarial attacks. To address these issues, we propose a novel gradient-free training algorithm, the Bidirectional Subspace Rotation Algorithm (B-SRA), which significantly improves the robustness and convergence behavior of BAM. Through comprehensive experiments, we identify two key principles – orthogonal weight matrices (OWM) and gradient-pattern alignment (GPA) – as central to enhancing the robustness of BAM. Motivated by these findings, we introduce new regularization strategies into B-BP, resulting in models with greatly improved resistance to corruption and adversarial perturbations. We further conduct an ablation study across different training strategies to determine the most robust configuration and evaluate BAM’s performance under a variety of attack scenarios and memory capacities, including 50, 100, and 200 associative pairs. Among all methods, the SAME configuration, which integrates both OWM and GPA, achieves the strongest resilience. Overall, our results demonstrate that B-SRA and the proposed regularization strategies lead to substantially more robust associative memories and open new directions for building resilient neural architectures.

[883] A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Haoyan Xu, Ruizhi Qian, Jiate Li, Yushun Dong, Minghao Lin, Hanson Yan, Zhengtao Yao, Qinghua Liu, Junhao Dong, Ruopeng Huang, Yue Zhao, Mengyuan Li

Main category: cs.LG

TL;DR: This paper presents the first systematic study of model extraction attacks against Graph Foundation Models (GFMs), showing attackers can steal GFMs with minimal cost while preserving zero-shot inference capabilities.

DetailsMotivation: Graph Foundation Models are valuable intellectual assets due to extensive pretraining costs and cross-domain knowledge, making them attractive targets for model extraction attacks, but prior security research only covered small graph neural networks.

Method: Formalized black-box threat model with six attack scenarios, introduced lightweight extraction method using supervised regression of graph embeddings to train attacker encoder without contrastive pretraining data.

Result: Experiments on seven datasets show attackers can approximate victim GFMs using only a tiny fraction of original training cost with almost no accuracy loss, preserving zero-shot inference on unseen graphs.

Conclusion: GFMs greatly expand the model extraction attack surface, highlighting the need for deployment-aware security defenses in large-scale graph learning systems.

Abstract: Graph machine learning has advanced rapidly in tasks such as link prediction, anomaly detection, and node classification. As models scale up, pretrained graph models have become valuable intellectual assets because they encode extensive computation and domain expertise. Building on these advances, Graph Foundation Models (GFMs) mark a major step forward by jointly pretraining graph and text encoders on massive and diverse data. This unifies structural and semantic understanding, enables zero-shot inference, and supports applications such as fraud detection and biomedical analysis. However, the high pretraining cost and broad cross-domain knowledge in GFMs also make them attractive targets for model extraction attacks (MEAs). Prior work has focused only on small graph neural networks trained on a single graph, leaving the security implications for large-scale and multimodal GFMs largely unexplored. This paper presents the first systematic study of MEAs against GFMs. We formalize a black-box threat model and define six practical attack scenarios covering domain-level and graph-specific extraction goals, architectural mismatch, limited query budgets, partial node access, and training data discrepancies. To instantiate these attacks, we introduce a lightweight extraction method that trains an attacker encoder using supervised regression of graph embeddings. Even without contrastive pretraining data, this method learns an encoder that stays aligned with the victim text encoder and preserves its zero-shot inference ability on unseen graphs. Experiments on seven datasets show that the attacker can approximate the victim model using only a tiny fraction of its original training cost, with almost no loss in accuracy. These findings reveal that GFMs greatly expand the MEA surface and highlight the need for deployment-aware security defenses in large-scale graph learning systems.

[884] Batch Matrix-form Equations and Implementation of Multilayer Perceptrons

Wieger Wesselink, Bram Grooten, Huub van de Wetering, Qiao Xiao, Decebal Constantin Mocanu

Main category: cs.LG

TL;DR: Provides complete batch matrix-form specifications for MLPs with forward/backward equations for all standard layers, validated symbolically and implemented in multiple frameworks with optimized sparse computation.

DetailsMotivation: MLPs are fundamental but rarely presented in complete batch matrix-form, relying instead on per-sample gradients or automatic differentiation. Explicit matrix-form is essential for transparent analysis and optimization, especially in sparse neural networks.

Method: Derived forward and backward equations for all standard and advanced layers (including batch normalization and softmax), validated using SymPy symbolic mathematics library, and implemented uniform reference implementations in NumPy, PyTorch, JAX, TensorFlow, and high-performance C++ backend optimized for sparse operations.

Result: Complete derivation of batch matrix-form backpropagation for MLPs, symbolic validation of all gradient equations, uniform implementations across multiple frameworks, and demonstration of efficient sparse computation enabled by explicit formulations.

Conclusion: Establishes a validated, extensible foundation for understanding, teaching, and researching neural network algorithms through mathematically rigorous batch matrix-form specifications.

Abstract: Multilayer perceptrons (MLPs) remain fundamental to modern deep learning, yet their algorithmic details are rarely presented in complete, explicit \emph{batch matrix-form}. Rather, most references express gradients per sample or rely on automatic differentiation. Although automatic differentiation can achieve equally high computational efficiency, the usage of batch matrix-form makes the computational structure explicit, which is essential for transparent, systematic analysis, and optimization in settings such as sparse neural networks. This paper fills that gap by providing a mathematically rigorous and implementation-ready specification of MLPs in batch matrix-form. We derive forward and backward equations for all standard and advanced layers, including batch normalization and softmax, and validate all equations using the symbolic mathematics library SymPy. From these specifications, we construct uniform reference implementations in NumPy, PyTorch, JAX, TensorFlow, and a high-performance C++ backend optimized for sparse operations. Our main contributions are: (1) a complete derivation of batch matrix-form backpropagation for MLPs, (2) symbolic validation of all gradient equations, (3) uniform Python and C++ reference implementations grounded in a small set of matrix primitives, and (4) demonstration of how explicit formulations enable efficient sparse computation. Together, these results establish a validated, extensible foundation for understanding, teaching, and researching neural network algorithms.

[885] Beyond the Laplacian: Interpolated Spectral Augmentation for Graph Neural Networks

Ziyao Cui, Edric Tam

Main category: cs.LG

TL;DR: ILEs are spectral embeddings from interpolated graph matrices that augment limited node features to improve GNN performance.

DetailsMotivation: GNNs rely on informative node features, which are often limited or absent in real datasets. While Laplacian spectral embeddings help, alternative graph matrices may provide better representations.

Method: Introduce Interpolated Laplacian Embeddings (ILEs) derived from a family of graph matrices, with theoretical interpretation using spectral graph theory.

Result: ILEs improve GNN performance across architectures in simulations and real-world datasets when used for feature augmentation.

Conclusion: ILEs offer a practical spectral augmentation approach that expands options for handling limited node features in graph learning.

Abstract: Graph neural networks (GNNs) are fundamental tools in graph machine learning. The performance of GNNs relies crucially on the availability of informative node features, which can be limited or absent in real-life datasets and applications. A natural remedy is to augment the node features with embeddings computed from eigenvectors of the graph Laplacian matrix. While it is natural to default to Laplacian spectral embeddings, which capture meaningful graph connectivity information, we ask whether spectral embeddings from alternative graph matrices can also provide useful representations for learning. We introduce Interpolated Laplacian Embeddings (ILEs), which are derived from a simple yet expressive family of graph matrices. Using tools from spectral graph theory, we offer a straightforward interpretation of the structural information that ILEs capture. We demonstrate through simulations and experiments on real-world datasets that feature augmentation via ILEs can improve performance across commonly used GNN architectures. Our work offers a straightforward and practical approach that broadens the practitioner’s spectral augmentation toolkit when node features are limited.

[886] A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

C. César Claros Olivares, Austin J. Brockmeier

Main category: cs.LG

TL;DR: Systematic comparison of OOD detection methods across CLIP-stratified regimes shows that learned feature space determines OOD efficacy, with probabilistic scores dominating misclassification detection and geometry-aware methods excelling under stronger distribution shifts.

DetailsMotivation: To provide statistically grounded guidance for OOD detection method selection by systematically comparing various approaches across different representation paradigms and distribution shift scenarios.

Method: Used multiple-comparison-controlled, rank-based pipeline (Friedman test with Conover-Holm post-hoc) and Bron-Kerbosch cliques to evaluate OOD detection methods on CNNs trained from scratch and fine-tuned ViTs across CIFAR-10/100, SuperCIFAR-100, and TinyImageNet datasets.

Result: Probabilistic scores (MSR, GEN) dominate misclassification detection for both CNNs and ViTs. Under stronger shifts, geometry-aware scores (NNGuide, fDBD, CTM) prevail on CNNs, while GradNorm and KPCA Reconstruction Error remain competitive on ViTs. PCA projection improves several detectors.

Conclusion: Results support a representation-centric view of OOD detection and provide statistically grounded guidance for method selection under distribution shift, highlighting the importance of learned feature space in determining OOD efficacy.

Abstract: We present a systematic comparison of out-of-distribution (OOD) detection methods across CLIP-stratified regimes using AURC and AUGRC as primary metrics. Experiments cover two representation paradigms: CNNs trained from scratch and a fine-tuned Vision Transformer (ViT), evaluated on CIFAR-10/100, SuperCIFAR-100, and TinyImageNet. Using a multiple-comparison-controlled, rank-based pipeline (Friedman test with Conover-Holm post-hoc) and Bron-Kerbosch cliques, we find that the learned feature space largely determines OOD efficacy. For both CNNs and ViTs, probabilistic scores (e.g., MSR, GEN) dominate misclassification (ID) detection. Under stronger shifts, geometry-aware scores (e.g., NNGuide, fDBD, CTM) prevail on CNNs, whereas on ViTs GradNorm and KPCA Reconstruction Error remain consistently competitive. We further show a class-count-dependent trade-off for Monte-Carlo Dropout (MCD) and that a simple PCA projection improves several detectors. These results support a representation-centric view of OOD detection and provide statistically grounded guidance for method selection under distribution shift.

[887] SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis

Munib Mesinovic, Tingting Zhu

Main category: cs.LG

TL;DR: SurvBench is an open-source preprocessing pipeline that standardizes EHR data from critical care databases for reproducible survival analysis using deep learning models.

DetailsMotivation: To address the reproducibility crisis in survival analysis caused by inconsistent preprocessing methodologies across different studies, enabling fair comparison of deep learning models.

Method: Developed a comprehensive preprocessing pipeline that transforms raw PhysioNet datasets into standardized tensors, implementing data quality controls, patient-level splitting, missingness tracking, and temporal aggregation for multi-modal data.

Result: Created SurvBench with data loaders for three major critical care databases (MIMIC-IV, eICU, MC-MED) supporting diverse data modalities and both single-risk and competing-risks scenarios, compatible with pycox library and standard models.

Conclusion: SurvBench bridges the “preprocessing gap” in survival analysis research, allowing researchers to focus on methodological innovation rather than data engineering while ensuring reproducibility and fair model comparisons.

Abstract: Electronic health record (EHR) data present tremendous opportunities for advancing survival analysis through deep learning, yet reproducibility remains severely constrained by inconsistent preprocessing methodologies. We present SurvBench, a comprehensive, open-source preprocessing pipeline that transforms raw PhysioNet datasets into standardised, model-ready tensors for multi-modal survival analysis. SurvBench provides data loaders for three major critical care databases, MIMIC-IV, eICU, and MC-MED, supporting diverse modalities including time-series vitals, static demographics, ICD diagnosis codes, and radiology reports. The pipeline implements rigorous data quality controls, patient-level splitting to prevent data leakage, explicit missingness tracking, and standardised temporal aggregation. SurvBench handles both single-risk (e.g., in-hospital mortality) and competing-risks scenarios (e.g., multiple discharge outcomes). The outputs are compatible with pycox library packages and implementations of standard statistical and deep learning models. By providing reproducible, configuration-driven preprocessing with comprehensive documentation, SurvBench addresses the “preprocessing gap” that has hindered fair comparison of deep learning survival models, enabling researchers to focus on methodological innovation rather than data engineering.

[888] Learning the relative composition of EEG signals using pairwise relative shift pretraining

Christopher Sandino, Sayeri Lala, Geeling Chau, Melika Ayoughi, Behrooz Mahasseni, Ellen Zippi, Ali Moin, Erdrin Azemi, Hanlin Goh

Main category: cs.LG

TL;DR: PARS pretraining uses relative temporal shift prediction between EEG window pairs to learn long-range dependencies, outperforming masked reconstruction methods in self-supervised EEG representation learning.

DetailsMotivation: Current EEG SSL methods focus on masked reconstruction which captures local temporal patterns, but position prediction pretraining for learning long-range dependencies in neural signals remains underexplored.

Method: Introduces PARS pretraining - a pretext task that predicts relative temporal shifts between randomly sampled EEG window pairs, encouraging encoders to capture relative temporal composition and long-range dependencies.

Result: PARS-pretrained transformers consistently outperform existing pretraining strategies in label-efficient and transfer learning settings across various EEG decoding tasks.

Conclusion: PARS establishes a new paradigm for self-supervised EEG representation learning by effectively capturing long-range dependencies through relative temporal shift prediction.

Abstract: Self-supervised learning (SSL) offers a promising approach for learning electroencephalography (EEG) representations from unlabeled data, reducing the need for expensive annotations for clinical applications like sleep staging and seizure detection. While current EEG SSL methods predominantly use masked reconstruction strategies like masked autoencoders (MAE) that capture local temporal patterns, position prediction pretraining remains underexplored despite its potential to learn long-range dependencies in neural signals. We introduce PAirwise Relative Shift or PARS pretraining, a novel pretext task that predicts relative temporal shifts between randomly sampled EEG window pairs. Unlike reconstruction-based methods that focus on local pattern recovery, PARS encourages encoders to capture relative temporal composition and long-range dependencies inherent in neural signals. Through comprehensive evaluation on various EEG decoding tasks, we demonstrate that PARS-pretrained transformers consistently outperform existing pretraining strategies in label-efficient and transfer learning settings, establishing a new paradigm for self-supervised EEG representation learning.

[889] Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms

Patrick Parschan, Charlott Jakob

Main category: cs.LG

TL;DR: This paper provides the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, categorizing them into four methodological families and offering practical guidance for researchers.

DetailsMotivation: The field of CT-IPE has become fragmented over two decades of development, lacking systematic comparison and clear guidance for applied use despite widespread adoption across political science, communication, and computational social science.

Method: Conducted a systematic literature review identifying 25 CT-IPE algorithms, performed manual content analysis of their modeling assumptions, and introduced a conceptual framework distinguishing how algorithms generate, capture, and aggregate textual variance.

Result: Identified four methodological families: word-frequency, topic modeling, word embedding, and LLM-based approaches, with critical assessment of their assumptions, interpretability, scalability, and limitations.

Conclusion: The review provides structured synthesis of algorithm development, practical guidance for researchers highlighting trade-offs, and emphasizes that differences in estimation outcomes across algorithms are informative, underscoring the need for systematic benchmarking.

Abstract: This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends – beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families – word-frequency, topic modeling, word embedding, and LLM-based approaches – and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.

[890] Computation-aware Energy-harvesting Federated Learning: Cyclic Scheduling with Selective Participation

Eunjeong Jeong, Nikolaos Pappas

Main category: cs.LG

TL;DR: FedBacys is a battery-aware federated learning framework for energy-harvesting systems that uses cyclic client participation based on battery levels to reduce energy consumption and improve learning stability.

DetailsMotivation: Federated Learning's increasing complexity causes significant energy consumption, especially critical in energy-harvesting FL systems where device participation fluctuates due to limited energy availability.

Method: Proposes FedBacys framework that clusters clients and schedules them sequentially based on battery levels, and FedBacys-Odd variant for selective participation to further reduce energy costs.

Result: The framework minimizes redundant computations, reduces system-wide energy usage, improves learning stability, and maintains performance while being more energy-efficient than existing algorithms.

Conclusion: FedBacys provides superior energy efficiency and robustness for energy-harvesting federated learning systems through battery-aware cyclic client participation.

Abstract: Federated Learning (FL) is a powerful paradigm for distributed learning, but its increasing complexity leads to significant energy consumption from client-side computations for training models. In particular, the challenge is critical in energy-harvesting FL (EHFL) systems where participation availability of each device oscillates due to limited energy. To address this, we propose FedBacys, a battery-aware EHFL framework using cyclic client participation based on users’ battery levels. By clustering clients and scheduling them sequentially, FedBacys minimizes redundant computations, reduces system-wide energy usage, and improves learning stability. We also introduce FedBacys-Odd, a more energy-efficient variant that allows clients to participate selectively, further reducing energy costs without compromising performance. We provide a convergence analysis for our framework and demonstrate its superior energy efficiency and robustness compared to existing algorithms through numerical experiments.

[891] Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

Xinming Gao, Shangzhe Li, Yujin Cai, Wenwu Yu

Main category: cs.LG

TL;DR: XQL and MXQL offline RL methods suffer from hyperparameter sensitivity and training instability. The proposed method addresses these by estimating temperature coefficient via quantile regression and adding value regularization for stable training.

DetailsMotivation: Offline RL is valuable for high-risk domains but existing methods like XQL and MXQL require extensive hyperparameter tuning per dataset and exhibit training instability.

Method: Proposed principled estimation of temperature coefficient β using quantile regression, plus value regularization technique inspired by constrained value learning for improved stability.

Result: Achieves competitive or superior performance on D4RL and NeoRL2 benchmarks with stable training dynamics and consistent hyperparameters across all datasets.

Conclusion: The proposed method successfully addresses hyperparameter sensitivity and training instability in offline RL while maintaining strong performance across diverse benchmarks.

Abstract: Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient $β$ via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.

[892] ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting

Xiang Ma, Taihua Chen, Pengcheng Wang, Xuemei Li, Caiming Zhang

Main category: cs.LG

TL;DR: ReCast is a lightweight time series forecasting framework that uses a learnable codebook to encode local patterns and a dual-path architecture for handling both regular structures and irregular fluctuations, with reliability-aware updates for robustness.

DetailsMotivation: Conventional time series forecasting methods with global decomposition are ineffective for real-world series with local, complex patterns and have high model complexity that limits real-time applicability.

Method: Uses patch-wise quantization with a learnable codebook to encode local patterns, dual-path architecture (quantization path for regular structures, residual path for irregular fluctuations), and reliability-aware codebook updates with DRO scheme.

Result: Extensive experiments show ReCast outperforms SOTA models in accuracy, efficiency, and adaptability to distribution shifts.

Conclusion: ReCast provides an effective solution for lightweight and robust time series forecasting by exploiting recurring local shapes with reliability-aware updates.

Abstract: Time series forecasting is crucial for applications in various domains. Conventional methods often rely on global decomposition into trend, seasonal, and residual components, which become ineffective for real-world series dominated by local, complex, and highly dynamic patterns. Moreover, the high model complexity of such approaches limits their applicability in real-time or resource-constrained environments. In this work, we propose a novel \textbf{RE}liability-aware \textbf{C}odebook-\textbf{AS}sisted \textbf{T}ime series forecasting framework (\textbf{ReCast}) that enables lightweight and robust prediction by exploiting recurring local shapes. ReCast encodes local patterns into discrete embeddings through patch-wise quantization using a learnable codebook, thereby compactly capturing stable regular structures. To compensate for residual variations not preserved by quantization, ReCast employs a dual-path architecture comprising a quantization path for efficient modeling of regular structures and a residual path for reconstructing irregular fluctuations. A central contribution of ReCast is a reliability-aware codebook update strategy, which incrementally refines the codebook via weighted corrections. These correction weights are derived by fusing multiple reliability factors from complementary perspectives by a distributionally robust optimization (DRO) scheme, ensuring adaptability to non-stationarity and robustness to distribution shifts. Extensive experiments demonstrate that ReCast outperforms state-of-the-art (SOTA) models in accuracy, efficiency, and adaptability to distribution shifts.

[893] Selecting Fine-Tuning Examples by Quizzing VLMs

Tenghao Ji, Eytan Adar

Main category: cs.LG

TL;DR: QZLoRA is a framework that uses QuizRank to automatically select high-quality training images for fine-tuning text-to-image diffusion models via LoRA, resulting in better-aligned and more representative generated images with fewer samples.

DetailsMotivation: Fine-tuning text-to-image models with varying quality image sets (like Wikipedia Commons) often produces poor results. Training with images that properly exemplify target concepts is crucial for generating representative outputs.

Method: Proposes QZLoRA framework that leverages QuizRank - a method that treats images as ’educational interventions’ and ‘quizzes’ a Vision-Language Model (VLM) to automatically rank and select high-quality images for Low-Rank Adaptation (LoRA) fine-tuning.

Result: QZLoRA produces better aligned, photorealistic images with fewer training samples. The fine-tuned models can also generate stylized illustrations that are similarly representative of the target concepts.

Conclusion: Combining automated visual reasoning (via QuizRank) with parameter-efficient fine-tuning (LoRA) shows promise for topic-adaptive generative modeling, enabling better image selection and more representative outputs.

Abstract: A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that \textit{do} exemplify the target concept (e.g., a \textit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an educational intervention' and quizzing’ a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.

[894] EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, Duane Boning

Main category: cs.LG

TL;DR: EARL is an entropy-aware reinforcement learning framework that improves Verilog generation by focusing gradient updates on high-uncertainty tokens that influence control flow and module structure, achieving up to 14.7% higher functional pass rates.

DetailsMotivation: Current LLMs for RTL code generation suffer from syntax errors, functional hallucinations, and weak alignment to designer intent. Hardware provides executable and formally checkable signals that can be used to better align model outputs with design intent.

Method: EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens (e.g., always, if, assign, posedge), concentrating updates on functionally important code regions.

Result: EARL improves functional pass rates over prior LLM baselines by up to 14.7% on VerilogEval and RTLLM benchmarks, while reducing unnecessary updates and improving training stability.

Conclusion: Focusing reinforcement learning on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation, bridging the gap between model capability and real-world RTL design demands.

Abstract: Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code. Despite this progress, a gap remains between model capability and the demands of real-world RTL design, including syntax errors, functional hallucinations, and weak alignment to designer intent. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising approach to bridge this gap, as hardware provides executable and formally checkable signals that can be used to further align model outputs with design intent. However, in long, structured RTL code sequences, not all tokens contribute equally to functional correctness, and naïvely spreading gradients across all tokens dilutes learning signals. A key insight from our entropy analysis in RTL generation is that only a small fraction of tokens (e.g., always, if, assign, posedge) exhibit high uncertainty and largely influence control flow and module structure. To address these challenges, we present EARL, an Entropy-Aware Reinforcement Learning framework for Verilog generation. EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens. This approach preserves training stability and concentrates gradient updates on functionally important regions of code. Our experiments on VerilogEval and RTLLM show that EARL improves functional pass rates over prior LLM baselines by up to 14.7%, while reducing unnecessary updates and improving training stability. These results indicate that focusing RL on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation.

[895] P1: Mastering Physics Olympiads with Reinforcement Learning

Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui

Main category: cs.LG

TL;DR: P1 is a family of open-source physics reasoning models trained through reinforcement learning that achieves Gold-medal performance at IPhO 2025 and dominates multiple physics competitions, demonstrating exceptional physics reasoning capabilities.

DetailsMotivation: To advance physics research by developing large language models with exceptional physics reasoning capabilities that can solve Olympiad-level physics problems, moving beyond puzzle-solving to science-grade reasoning that binds symbols to reality.

Method: Training large language models entirely through reinforcement learning (RL) to develop physics reasoning capabilities, with an additional agentic framework called PhysicsMinions for enhanced performance.

Result: P1-235B-A22B achieves Gold-medal performance at IPhO 2025 and wins 12 gold medals out of 13 international/regional physics competitions. P1-30B-A3B gets a silver medal at IPhO 2025. When equipped with PhysicsMinions, P1-235B-A22B achieves overall No.1 on IPhO 2025 and highest average score across 13 competitions.

Conclusion: The P1 models demonstrate exceptional physics reasoning capabilities and also show great performance on other reasoning tasks like math and coding, indicating strong generalizability of the approach.

Abstract: Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.

[896] Mesh-based Super-resolution of Detonation Flows with Multiscale Graph Transformers

Shivam Barwey, Pinaki Pal

Main category: cs.LG

TL;DR: A novel multiscale graph transformer approach (SR-GT) is developed for mesh-based super-resolution of reacting flows, leveraging graph-based representations and transformer architecture to capture long-range dependencies and generate high-resolution flow fields.

DetailsMotivation: Super-resolution flow reconstruction is valuable for applications like subgrid closure modeling, spatiotemporal forecasting acceleration, data compression, and upscaling sparse experimental measurements in complex reacting flows.

Method: Uses a multiscale graph transformer with element-local (+ neighborhood) graph representation for coarse input, tokenizes the data, and processes through transformer components to generate fine output, compatible with complex geometries and unstructured grids.

Result: SR-GT demonstrates high super-resolution accuracy for reacting flow-field features and superior performance compared to traditional interpolation-based super-resolution schemes, particularly for 2D detonation propagation in hydrogen-air mixtures.

Conclusion: The SR-GT framework provides an effective data-driven approach for mesh-based super-resolution of reacting flows, successfully capturing complex multiscale behavior and outperforming conventional methods.

Abstract: Super-resolution flow reconstruction using state-of-the-art data-driven techniques is valuable for a variety of applications, such as subgrid/subfilter closure modeling, accelerating spatiotemporal forecasting, data compression, and serving as an upscaling tool for sparse experimental measurements. In the present work, a first-of-its-kind multiscale graph transformer approach is developed for mesh-based super-resolution (SR-GT) of reacting flows. The novel data-driven modeling paradigm leverages a graph-based flow-field representation compatible with complex geometries and non-uniform/unstructured grids. Further, the transformer backbone captures long-range dependencies between different parts of the low-resolution flow-field, identifies important features, and then generates the super-resolved flow-field that preserves those features at a higher resolution. The performance of SR-GT is demonstrated in the context of spectral-element-discretized meshes for a challenging test problem of 2D detonation propagation within a premixed hydrogen-air mixture exhibiting highly complex multiscale reacting flow behavior. The SR-GT framework utilizes a unique element-local (+ neighborhood) graph representation for the coarse input, which is then tokenized before being processed by the transformer component to produce the fine output. It is demonstrated that SR-GT provides high super-resolution accuracy for reacting flow-field features and superior performance compared to traditional interpolation-based SR schemes.

[897] Improving Graph Embeddings in Machine Learning Using Knowledge Completion with Validation in a Case Study on COVID-19 Spread

Rosario Napoli, Gabriele Morabito, Antonio Celesti, Massimo Villari, Maria Fazio

Main category: cs.LG

TL;DR: Proposes a GML pipeline with Knowledge Completion phase to uncover latent semantics before embedding generation, significantly altering embedding space geometry.

DetailsMotivation: Graph embeddings derived from explicit topology may miss crucial implicit knowledge in sparse datasets, affecting representation quality.

Method: Integrates Knowledge Completion phase focusing on transitive relations, models hidden connections with decay-based inference functions to reshape graph topology before embedding generation.

Result: Experiments show the pipeline significantly alters embedding space geometry and redefines graph representation quality.

Conclusion: The Knowledge Completion phase is a transformative step, not just simple enrichment, that redefines graph representation quality in GML pipelines.

Abstract: The rise of graph-structured data has driven major advances in Graph Machine Learning (GML), where graph embeddings (GEs) map features from Knowledge Graphs (KGs) into vector spaces, enabling tasks like node classification and link prediction. However, since GEs are derived from explicit topology and features, they may miss crucial implicit knowledge hidden in seemingly sparse datasets, affecting graph structure and their representation. We propose a GML pipeline that integrates a Knowledge Completion (KC) phase to uncover latent dataset semantics before embedding generation. Focusing on transitive relations, we model hidden connections with decay-based inference functions, reshaping graph topology, with consequences on embedding dynamics and aggregation processes in GraphSAGE and Node2Vec. Experiments show that our GML pipeline significantly alters the embedding space geometry, demonstrating that its introduction is not just a simple enrichment but a transformative step that redefines graph representation quality.

[898] Treatment Stitching with Schrödinger Bridge for Enhancing Offline Reinforcement Learning in Adaptive Treatment Strategies

Dong-Hee Shin, Deok-Joong Lee, Young-Han Son, Tae-Eui Kam

Main category: cs.LG

TL;DR: TreatStitch is a data augmentation framework that generates clinically valid treatment trajectories by stitching segments from existing data and using Schrödinger bridges to connect dissimilar states, improving offline RL performance for adaptive treatment strategies.

DetailsMotivation: Offline RL for clinical ATS is limited by data scarcity, and conventional online RL is unsafe for patients. There's a need to augment existing treatment data while maintaining clinical validity.

Method: Proposes Treatment Stitching (TreatStitch) that identifies similar patient states across trajectories to stitch segments, and uses Schrödinger bridge method to generate smooth bridging trajectories for dissimilar states.

Result: Extensive experiments show TreatStitch effectively enhances offline RL performance across multiple treatment datasets while maintaining clinical validity.

Conclusion: TreatStitch provides a practical solution to data scarcity in clinical RL by generating diverse yet clinically valid synthetic trajectories, enabling better optimization of adaptive treatment strategies.

Abstract: Adaptive treatment strategies (ATS) are sequential decision-making processes that enable personalized care by dynamically adjusting treatment decisions in response to evolving patient symptoms. While reinforcement learning (RL) offers a promising approach for optimizing ATS, its conventional online trial-and-error learning mechanism is not permissible in clinical settings due to risks of harm to patients. Offline RL tackles this limitation by learning policies exclusively from historical treatment data, but its performance is often constrained by data scarcity-a pervasive challenge in clinical domains. To overcome this, we propose Treatment Stitching (TreatStitch), a novel data augmentation framework that generates clinically valid treatment trajectories by intelligently stitching segments from existing treatment data. Specifically, TreatStitch identifies similar intermediate patient states across different trajectories and stitches their respective segments. Even when intermediate states are too dissimilar to stitch directly, TreatStitch leverages the Schrödinger bridge method to generate smooth and energy-efficient bridging trajectories that connect dissimilar states. By augmenting these synthetic trajectories into the original dataset, offline RL can learn from a more diverse dataset, thereby improving its ability to optimize ATS. Extensive experiments across multiple treatment datasets demonstrate the effectiveness of TreatStitch in enhancing offline RL performance. Furthermore, we provide a theoretical justification showing that TreatStitch maintains clinical validity by avoiding out-of-distribution transitions.

[899] SenseRay-3D: Generalizable and Physics-Informed Framework for End-to-End Indoor Propagation Modeling

Yu Zheng, Kezhi Wang, Wenji Xi, Gang Yu, Jiming Chen, Jie Zhang

Main category: cs.LG

TL;DR: SenseRay-3D is a physics-informed neural network framework that directly predicts 3D path-loss heatmaps from RGB-D scans, eliminating manual geometry reconstruction and material annotation for indoor radio propagation modeling.

DetailsMotivation: Existing indoor radio propagation modeling approaches require labor-intensive manual modeling of geometry and material properties, limiting scalability and efficiency. There's a need for automated, generalizable solutions.

Method: Builds a sensing-driven voxelized scene representation encoding occupancy, electromagnetic material characteristics, and transmitter-receiver geometry, processed by a SwinUNETR-based neural network to infer environmental path-loss relative to free-space path-loss.

Result: Achieves mean absolute error of 4.27 dB on unseen environments and supports real-time inference at 217 ms per sample, demonstrating scalability, efficiency, and physical consistency. A comprehensive synthetic indoor propagation dataset is also developed.

Conclusion: SenseRay-3D represents a major advancement in sense-driven, generalizable, and physics-consistent indoor propagation modeling, significantly improving upon previous approaches like EM DeepRay.

Abstract: Modeling indoor radio propagation is crucial for wireless network planning and optimization. However, existing approaches often rely on labor-intensive manual modeling of geometry and material properties, resulting in limited scalability and efficiency. To overcome these challenges, this paper presents SenseRay-3D, a generalizable and physics-informed end-to-end framework that predicts three-dimensional (3D) path-loss heatmaps directly from RGB-D scans, thereby eliminating the need for explicit geometry reconstruction or material annotation. The proposed framework builds a sensing-driven voxelized scene representation that jointly encodes occupancy, electromagnetic material characteristics, and transmitter-receiver geometry, which is processed by a SwinUNETR-based neural network to infer environmental path-loss relative to free-space path-loss. A comprehensive synthetic indoor propagation dataset is further developed to validate the framework and to serve as a standardized benchmark for future research. Experimental results show that SenseRay-3D achieves a mean absolute error of 4.27 dB on unseen environments and supports real-time inference at 217 ms per sample, demonstrating its scalability, efficiency, and physical consistency. SenseRay-3D paves a new path for sense-driven, generalizable, and physics-consistent modeling of indoor propagation, marking a major leap beyond our pioneering EM DeepRay framework.

[900] Dynamic Anomaly Identification in Accounting Transactions via Multi-Head Self-Attention Networks

Yi Wang, Ruoyi Fang, Anzhuo Xie, Hanrui Feng, Jianlin Lai

Main category: cs.LG

TL;DR: Transformer-based real-time anomaly detection method for accounting transactions that models multi-dimensional records as time-series matrices and uses self-attention to capture global dependencies, achieving superior performance in detecting hidden abnormal behaviors.

DetailsMotivation: Address the challenges of detecting hidden abnormal behaviors and meeting high timeliness requirements in complex trading environments for accounting transaction monitoring.

Method: Models accounting data as time-series matrices with embedding layers and positional encoding, constructs sequence modeling with multi-head self-attention to capture global dependencies, and integrates feed-forward layers with regularization for deep feature representation.

Result: Outperforms baseline models in AUC, F1-Score, Precision, and Recall metrics, and maintains stable performance under different environmental conditions and data perturbations.

Conclusion: The Transformer-based framework is effective for dynamic anomaly detection in accounting transactions and provides methodological support for intelligent financial risk control and auditing.

Abstract: This study addresses the problem of dynamic anomaly detection in accounting transactions and proposes a real-time detection method based on a Transformer to tackle the challenges of hidden abnormal behaviors and high timeliness requirements in complex trading environments. The approach first models accounting transaction data by representing multi-dimensional records as time-series matrices and uses embedding layers and positional encoding to achieve low-dimensional mapping of inputs. A sequence modeling structure with multi-head self-attention is then constructed to capture global dependencies and aggregate features from multiple perspectives, thereby enhancing the ability to detect abnormal patterns. The network further integrates feed-forward layers and regularization strategies to achieve deep feature representation and accurate anomaly probability estimation. To validate the effectiveness of the method, extensive experiments were conducted on a public dataset, including comparative analysis, hyperparameter sensitivity tests, environmental sensitivity tests, and data sensitivity tests. Results show that the proposed method outperforms baseline models in AUC, F1-Score, Precision, and Recall, and maintains stable performance under different environmental conditions and data perturbations. These findings confirm the applicability and advantages of the Transformer-based framework for dynamic anomaly detection in accounting transactions and provide methodological support for intelligent financial risk control and auditing.

[901] HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning

Zejiao Liu, Junqi Tu, Yitian Hong, Luolin Xiong, Yaochu Jin, Yang Tang, Fangfei Li

Main category: cs.LG

TL;DR: Proposes HCPO algorithm with conductor-based joint policy framework to coordinate multi-agent exploration and enhance joint policy expressive capacity.

DetailsMotivation: Existing MARL methods use independent agent exploration without coordination, limiting joint policy expressive capacity and exploration efficiency.

Method: Hierarchical Conductor-based Policy Optimization (HCPO) with conductor framework that coordinates exploration and aligns policy updates with performance improvement.

Result: HCPO outperforms competitive MARL baselines on StarCraftII, Multi-agent MuJoCo, and Multi-agent Particle Environment in cooperative efficiency and stability.

Conclusion: Conductor-based joint policy framework effectively coordinates multi-agent exploration and enhances joint policy performance with theoretical guarantees.

Abstract: In cooperative Multi-Agent Reinforcement Learning (MARL), efficient exploration is crucial for optimizing the performance of joint policy. However, existing methods often update joint policies via independent agent exploration, without coordination among agents, which inherently constrains the expressive capacity and exploration of joint policies. To address this issue, we propose a conductor-based joint policy framework that directly enhances the expressive capacity of joint policies and coordinates exploration. In addition, we develop a Hierarchical Conductor-based Policy Optimization (HCPO) algorithm that instructs policy updates for the conductor and agents in a direction aligned with performance improvement. A rigorous theoretical guarantee further establishes the monotonicity of the joint policy optimization process. By deploying local conductors, HCPO retains centralized training benefits while eliminating inter-agent communication during execution. Finally, we evaluate HCPO on three challenging benchmarks: StarCraftII Multi-agent Challenge, Multi-agent MuJoCo, and Multi-agent Particle Environment. The results indicate that HCPO outperforms competitive MARL baselines regarding cooperative efficiency and stability.

[902] FairGSE: Fairness-Aware Graph Neural Network without High False Positive Rates

Zhenqiang Ye, Jinjie Lu, Tianlong Gu, Fengrui Hao, Xuemin Wang

Main category: cs.LG

TL;DR: FairGSE is a fairness-aware GNN framework that reduces false positive rates by 39% while maintaining fairness improvements, addressing the issue where existing methods neglect negative label prediction accuracy.

DetailsMotivation: Existing fairness-aware GNNs focus on fairness metrics but neglect the model's ability to predict negative labels, leading to extremely high false positive rates that can be problematic in high-risk scenarios.

Method: Proposed FairGSE framework that maximizes two-dimensional structural entropy (2D-SE) to improve fairness without neglecting false positives.

Result: Experiments show FairGSE reduces false positive rates by 39% compared to state-of-the-art fairness-aware GNNs while achieving comparable fairness improvements.

Conclusion: Classification performance should be carefully calibrated while improving fairness, rather than simply constraining accuracy loss, and FairGSE effectively addresses this balance.

Abstract: Graph neural networks (GNNs) have emerged as the mainstream paradigm for graph representation learning due to their effective message aggregation. However, this advantage also amplifies biases inherent in graph topology, raising fairness concerns. Existing fairness-aware GNNs provide satisfactory performance on fairness metrics such as Statistical Parity and Equal Opportunity while maintaining acceptable accuracy trade-offs. Unfortunately, we observe that this pursuit of fairness metrics neglects the GNN’s ability to predict negative labels, which renders their predictions with extremely high False Positive Rates (FPR), resulting in negative effects in high-risk scenarios. To this end, we advocate that classification performance should be carefully calibrated while improving fairness, rather than simply constraining accuracy loss. Furthermore, we propose Fair GNN via Structural Entropy (\textbf{FairGSE}), a novel framework that maximizes two-dimensional structural entropy (2D-SE) to improve fairness without neglecting false positives. Experiments on several real-world datasets show FairGSE reduces FPR by 39% vs. state-of-the-art fairness-aware GNNs, with comparable fairness improvement.

[903] Fusion-ResNet: A Lightweight multi-label NILM Model Using PCA-ICA Feature Fusion

Sahar Moghimian Hoosh, Ilia Kamyshev, Henni Ouerdane

Main category: cs.LG

TL;DR: Proposes an end-to-end NILM framework with ICA-PCA feature fusion and lightweight Fusion-ResNet model for multi-label classification, achieving higher F1 scores and robustness to multiple simultaneous appliances.

DetailsMotivation: Address challenges in real-world NILM deployment including overfitting, low generalization, and difficulty in disaggregating multiple simultaneously operating appliances.

Method: End-to-end framework with high-frequency labeled data, novel ICA-PCA fused feature extraction, and lightweight Fusion-ResNet neural network for multi-label classification.

Result: Achieves higher average F1 scores across appliances compared to state-of-the-art methods while minimizing training/inference time. Robust to up to 15 concurrently active appliances.

Conclusion: The proposed feature-based Fusion-ResNet model effectively addresses NILM challenges with improved performance and computational efficiency.

Abstract: Non-intrusive load monitoring (NILM) is an advanced load monitoring technique that uses data-driven algorithms to disaggregate the total power consumption of a household into the consumption of individual appliances. However, real-world NILM deployment still faces major challenges, including overfitting, low model generalization, and disaggregating a large number of appliances operating at the same time. To address these challenges, this work proposes an end-to-end framework for the NILM classification task, which consists of high-frequency labeled data, a feature extraction method, and a lightweight neural network. Within this framework, we introduce a novel feature extraction method that fuses Independent Component Analysis (ICA) and Principal Component Analysis (PCA) features. Moreover, we propose a lightweight architecture for multi-label NILM classification (Fusion-ResNet). The proposed feature-based model achieves a higher $F1$ score on average and across different appliances compared to state-of-the-art NILM classifiers while minimizing the training and inference time. Finally, we assessed the performance of our model against baselines with a varying number of simultaneously active devices. Results demonstrate that Fusion-ResNet is relatively robust to stress conditions with up to 15 concurrently active appliances.

[904] Variation-Bounded Loss for Noise-Tolerant Learning

Jialiang Wang, Xiong Zhou, Xianming Liu, Gangfeng Hu, Deming Zhai, Junjun Jiang, Haoliang Li

Main category: cs.LG

TL;DR: Proposes Variation-Bounded Loss (VBL) family with bounded variation ratio property for robust learning against noisy labels, showing smaller variation ratio improves robustness and enables asymmetric condition relaxation.

DetailsMotivation: To mitigate the negative impact of noisy labels in supervised learning by developing robust loss functions that can handle label noise effectively.

Method: Introduces Variation Ratio as a novel robustness property and proposes Variation-Bounded Loss (VBL) family with bounded variation ratio. Provides theoretical analysis showing smaller variation ratio leads to better robustness and enables relaxation of symmetric condition.

Result: Positive experiments on various datasets demonstrate the effectiveness and flexibility of the proposed approach in handling noisy labels.

Conclusion: Variation-Bounded Loss functions with bounded variation ratio provide an effective and flexible solution for robust learning against noisy labels, offering theoretical guarantees and practical applicability.

Abstract: Mitigating the negative impact of noisy labels has been aperennial issue in supervised learning. Robust loss functions have emerged as a prevalent solution to this problem. In this work, we introduce the Variation Ratio as a novel property related to the robustness of loss functions, and propose a new family of robust loss functions, termed Variation-Bounded Loss (VBL), which is characterized by a bounded variation ratio. We provide theoretical analyses of the variation ratio, proving that a smaller variation ratio would lead to better robustness. Furthermore, we reveal that the variation ratio provides a feasible method to relax the symmetric condition and offers a more concise path to achieve the asymmetric condition. Based on the variation ratio, we reformulate several commonly used loss functions into a variation-bounded form for practical applications. Positive experiments on various datasets exhibit the effectiveness and flexibility of our approach.

[905] Finding Time Series Anomalies using Granular-ball Vector Data Description

Lifeng Shen, Liang Peng, Ruiwen Liu, Shuyin Xia, Yi Liu

Main category: cs.LG

TL;DR: GBOC is a novel anomaly detection method using granular-ball representations that partition data into compact regions, improving robustness and efficiency in time series analysis.

DetailsMotivation: Traditional methods like nearest neighbor and clustering have rigid assumptions that fail in complex temporal scenarios, requiring more adaptive approaches.

Method: Uses Granular-ball Vector Data Description (GVDD) with density-guided hierarchical splitting to create compact regions, aligning samples with granular-ball centers during training and computing anomaly scores based on distance during inference.

Result: Extensive experiments show GBOC effectively handles time series anomaly detection challenges with improved robustness and efficiency.

Conclusion: GBOC provides a superior approach to time series anomaly detection by focusing on dense regions and reducing prototype complexity while preserving local topological structure.

Abstract: Modeling normal behavior in dynamic, nonlinear time series data is challenging for effective anomaly detection. Traditional methods, such as nearest neighbor and clustering approaches, often depend on rigid assumptions, such as a predefined number of reliable neighbors or clusters, which frequently break down in complex temporal scenarios. To address these limitations, we introduce the Granular-ball One-Class Network (GBOC), a novel approach based on a data-adaptive representation called Granular-ball Vector Data Description (GVDD). GVDD partitions the latent space into compact, high-density regions represented by granular-balls, which are generated through a density-guided hierarchical splitting process and refined by removing noisy structures. Each granular-ball serves as a prototype for local normal behavior, naturally positioning itself between individual instances and clusters while preserving the local topological structure of the sample set. During training, GBOC improves the compactness of representations by aligning samples with their nearest granular-ball centers. During inference, anomaly scores are computed based on the distance to the nearest granular-ball. By focusing on dense, high-quality regions and significantly reducing the number of prototypes, GBOC delivers both robustness and efficiency in anomaly detection. Extensive experiments validate the effectiveness and superiority of the proposed method, highlighting its ability to handle the challenges of time series anomaly detection.

[906] Open Banking Foundational Model: Learning Language Representations from Few Financial Transactions

Gustavo Polleti, Marlesson Santana, Eduardo Fontes

Main category: cs.LG

TL;DR: A multimodal foundational model for financial transactions that combines structured attributes and textual descriptions, outperforming traditional methods and showing strong performance in data-scarce Open Banking scenarios.

DetailsMotivation: To create a unified representation for financial transactions that integrates both structured and unstructured data, addressing limitations of traditional feature engineering and discrete event sequence methods.

Method: Adapted masked language modeling to transaction sequences, integrating structured attributes and unstructured textual descriptions into multimodal representations.

Result: Outperforms classical feature engineering and discrete event sequence methods, particularly effective in data-scarce Open Banking scenarios. First large-scale study across thousands of financial institutions in North America.

Conclusion: Multimodal representations can generalize across geographies and institutions, highlighting the potential of self-supervised models for financial applications like fraud prevention, credit risk, and customer insights.

Abstract: We introduced a multimodal foundational model for financial transactions that integrates both structured attributes and unstructured textual descriptions into a unified representation. By adapting masked language modeling to transaction sequences, we demonstrated that our approach not only outperforms classical feature engineering and discrete event sequence methods but is also particularly effective in data-scarce Open Banking scenarios. To our knowledge, this is the first large-scale study across thousands of financial institutions in North America, providing evidence that multimodal representations can generalize across geographies and institutions. These results highlight the potential of self-supervised models to advance financial applications ranging from fraud prevention and credit risk to customer insights

[907] Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

Main category: cs.LG

TL;DR: The paper identifies position-dependent gradient weakening in autoregressive training as the cause of incomplete safety learning in LLMs, introduces base-favored tokens as indicators, and develops targeted completion methods to improve adversarial robustness.

DetailsMotivation: Large language models remain vulnerable to adversarial attacks despite safety alignment efforts, indicating fundamental limitations in current safety training methodologies.

Method: Mechanistic analysis of gradient weakening, identification of base-favored tokens as computational indicators, and development of targeted completion with adaptive penalties and hybrid teacher distillation.

Result: Experimental evaluation shows 48-98% reductions in attack success rates across Llama and Qwen model families while preserving general capabilities.

Conclusion: The research provides both mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

Abstract: Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens – vocabulary elements where base models assign higher probability than aligned models – as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48–98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

[908] Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

Houtan Ghaffari, Lukas Rauch, Paul Devos

Main category: cs.LG

TL;DR: A lightweight neural network (Residual-MLP-RNN) and three-stage training pipeline for birdsong syllable annotation that reduces annotation costs through self-supervised learning, supervised training with augmentation, and semi-supervised post-training.

DetailsMotivation: Birdsong research requires precise syllable-level annotations, but manual annotation is expensive. There's a need for automated, data-efficient methods to reduce annotation costs in bioacoustics, neuroscience, and linguistics research.

Method: Residual-MLP-RNN architecture with three-stage training: 1) Self-supervised learning using masked prediction and online clustering on unlabeled data, 2) Supervised training with data augmentation for frame-level detection, 3) Semi-supervised post-training aligned with downstream task using unlabeled data.

Result: Demonstrated effective performance on complex Canary songs in extreme label-scarcity scenarios, validating the method for difficult birdsong annotation tasks. Self-supervised embeddings showed potential for linear probing and unsupervised analysis.

Conclusion: The proposed approach enables reliable deep birdsong syllable detection with minimal expert labor, making it suitable for complex birdsongs and reducing annotation costs significantly.

Abstract: Many bioacoustics, neuroscience, and linguistics research utilize birdsongs as proxy models to acquire knowledge in diverse areas. Developing models generally requires precisely annotated data at the level of syllables. Hence, automated and data-efficient methods that reduce annotation costs are in demand. This work presents a lightweight, yet performant neural network architecture for birdsong annotation called Residual-MLP-RNN. Then, it presents a robust three-stage training pipeline for developing reliable deep birdsong syllable detectors with minimal expert labor. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentations to create a robust model for frame-level syllable detection. The third stage is semi-supervised post-training, which leverages the unlabeled data again. However, unlike the initial phase, this time it is aligned with the downstream task. The performance of this data-efficient approach is demonstrated for the complex song of the Canary in extreme label-scarcity scenarios. Canary has one of the most difficult songs to annotate, which implicitly validates the method for other birds. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

[909] FGM optimization in complex domains using Gaussian process regression based profile generation algorithm

Chaitanya Kumar Konda, Piyush Agrawal, Shivansh Srivastava, Manish Agrawal

Main category: cs.LG

TL;DR: Proposes a Gaussian Process Regression-based algorithm for generating smooth functionally graded material profiles on complex-shaped domains, coupled with a modified genetic algorithm for optimization.

DetailsMotivation: Addresses the challenge of designing functionally graded materials for arbitrary-shaped domains, which requires handling complex geometries while maintaining smooth material transitions and boundary constraints.

Method: Uses Gaussian Process Regression to generate volume fraction profiles that handle complex shapes and boundary constraints, with a length scale parameter to control smoothness. Couples this with a modified genetic algorithm using a projection operator for optimization.

Result: The algorithm successfully generates diverse, smooth FGM profiles for complex domains while adhering to specified boundary conditions. The optimization framework demonstrates efficacy through thermoelastic optimization examples.

Conclusion: The proposed GPR-based profile generation with modified genetic algorithm provides an effective framework for designing functionally graded materials on arbitrary-shaped domains, offering control over smoothness and design space size.

Abstract: This manuscript addresses the challenge of designing functionally graded materials (FGMs) for arbitrary-shaped domains. Towards this goal, the present work proposes a generic volume fraction profile generation algorithm based on Gaussian Process Regression (GPR). The proposed algorithm can handle complex-shaped domains and generate smooth FGM profiles while adhering to the specified volume fraction values at boundaries/part of boundaries. The resulting design space from GPR comprises diverse profiles, enhancing the potential for discovering optimal configurations. Further, the algorithm allows the user to control the smoothness of the underlying profiles and the size of the design space through a length scale parameter. Further, the proposed profile generation scheme is coupled with the genetic algorithm to find the optimum FGM profiles for a given application. To make the genetic algorithm consistent with the GPR profile generation scheme, the standard simulated binary crossover operator in the genetic algorithm has been modified with a projection operator. We present numerous thermoelastic optimization examples to demonstrate the efficacy of the proposed profile generation algorithm and optimization framework.

[910] TSGDiff: Rethinking Synthetic Time Series Generation from a Pure Graph Perspective

Lifeng Shen, Xuyang Li, Lele Long

Main category: cs.LG

TL;DR: TSGDiff is a novel framework for time series generation using diffusion models with a graph-based perspective, representing time series as dynamic graphs and employing GNN-based encoder-decoder architecture.

DetailsMotivation: Diffusion models show promise in data generation but struggle with time series due to complex temporal dependencies and structural patterns that need to be captured.

Method: Represent time series as dynamic graphs based on Fourier spectrum and temporal dependencies, use GNN-based encoder-decoder for latent space construction, and employ diffusion process to model structural representation distribution.

Result: Experiments on real-world datasets show TSGDiff generates high-quality synthetic time series that preserve temporal dependencies and structural integrity.

Conclusion: TSGDiff advances synthetic time series generation by effectively capturing structural patterns through graph-based representation and diffusion modeling, with the proposed Topo-FID metric providing better structural similarity assessment.

Abstract: Diffusion models have shown great promise in data generation, yet generating time series data remains challenging due to the need to capture complex temporal dependencies and structural patterns. In this paper, we present \textit{TSGDiff}, a novel framework that rethinks time series generation from a graph-based perspective. Specifically, we represent time series as dynamic graphs, where edges are constructed based on Fourier spectrum characteristics and temporal dependencies. A graph neural network-based encoder-decoder architecture is employed to construct a latent space, enabling the diffusion process to model the structural representation distribution of time series effectively. Furthermore, we propose the Topological Structure Fidelity (Topo-FID) score, a graph-aware metric for assessing the structural similarity of time series graph representations. Topo-FID integrates two sub-metrics: Graph Edit Similarity, which quantifies differences in adjacency matrices, and Structural Entropy Similarity, which evaluates the entropy of node degree distributions. This comprehensive metric provides a more accurate assessment of structural fidelity in generated time series. Experiments on real-world datasets demonstrate that \textit{TSGDiff} generates high-quality synthetic time series data generation, faithfully preserving temporal dependencies and structural integrity, thereby advancing the field of synthetic time series generation.

[911] Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering

Ge Cheng, Shuo Wang, Yun Zhang

Main category: cs.LG

TL;DR: SC-InfoNCE is a novel contrastive learning loss function that introduces tunable convergence targets to flexibly control feature similarity alignment, outperforming standard InfoNCE across image, graph, and text tasks.

DetailsMotivation: Despite InfoNCE's empirical success in contrastive learning, its theoretical foundations remain limited. The authors aim to better understand and improve upon InfoNCE by modeling augmentation dynamics and feature clustering.

Method: Introduced explicit feature space and transition probability matrix to model augmentation dynamics. Proposed SC-InfoNCE loss with tunable convergence targets that scale the target matrix to control feature similarity alignment.

Result: SC-InfoNCE consistently achieves strong and reliable performance across diverse domains including image, graph, and text tasks on benchmark datasets.

Conclusion: The proposed SC-InfoNCE enables flexible control over feature similarity alignment, allowing training objectives to better match downstream data statistics, providing both theoretical insights and practical improvements over standard InfoNCE.

Abstract: Contrastive learning has emerged as a cornerstone of unsupervised representation learning across vision, language, and graph domains, with InfoNCE as its dominant objective. Despite its empirical success, the theoretical underpinnings of InfoNCE remain limited. In this work, we introduce an explicit feature space to model augmented views of samples and a transition probability matrix to capture data augmentation dynamics. We demonstrate that InfoNCE optimizes the probability of two views sharing the same source toward a constant target defined by this matrix, naturally inducing feature clustering in the representation space. Leveraging this insight, we propose Scaled Convergence InfoNCE (SC-InfoNCE), a novel loss function that introduces a tunable convergence target to flexibly control feature similarity alignment. By scaling the target matrix, SC-InfoNCE enables flexible control over feature similarity alignment, allowing the training objective to better match the statistical properties of downstream data. Experiments on benchmark datasets, including image, graph, and text tasks, show that SC-InfoNCE consistently achieves strong and reliable performance across diverse domains.

[912] Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Xuanyu Chen, Nan Yang, Shuai Wang, Dong Yuan

Main category: cs.LG

TL;DR: The paper analyzes how model scaling principles apply to federated learning, finding that optimal model size decreases with more clients and that FL inherently reduces achievable generalization performance compared to centralized training.

DetailsMotivation: As LLMs scale up, high-quality training data becomes scarce. Federated Learning offers access to abundant edge device data while preserving privacy, but its impact on model scaling remains unexplored.

Method: Derived a PAC-Bayes generalization error bound for FL models, found analytic solution for optimal model size by minimizing this bound, and empirically validated with extensive training experiments.

Result: Optimal model size has negative power law relationship with number of clients; FL reduces generalization performance upper bound; optimal size depends on average client compute.

Conclusion: Model scaling principles need adaptation for FL, with smaller optimal models in distributed settings and inherent performance limitations compared to centralized training.

Abstract: The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

[913] Evaluation of Multi- and Single-objective Learning Algorithms for Imbalanced Data

Szymon Wojciechowski, Michał Woźniak

Main category: cs.LG

TL;DR: This paper proposes a new evaluation methodology to reliably compare multi-objective optimization algorithms (returning Pareto fronts) with single-solution methods in imbalanced data classification, addressing the challenge of selecting appropriate solutions from Pareto fronts based on user preferences.

DetailsMotivation: There's a significant gap in classifier evaluation methodology for comparing methods that return single solutions with multi-objective optimization algorithms that return Pareto fronts, particularly in imbalanced data classification where multiple opposing criteria need optimization.

Method: The paper proposes a new evaluation approach that enables reliable comparison between MOO algorithms (returning Pareto fronts) and single-solution methods, focusing on selecting solutions from Pareto fronts tailored to user preferences for fair comparison.

Result: A novel evaluation framework is developed that allows for meaningful comparison between different types of algorithms, specifically addressing the challenge of selecting appropriate solutions from Pareto fronts for comparison with single-solution methods.

Conclusion: The proposed methodology fills an important gap in classifier evaluation by providing a reliable way to compare multi-objective algorithms with single-solution methods, with illustrative examples provided to demonstrate the approach.

Abstract: Many machine learning tasks aim to find models that work well not for a single, but for a group of criteria, often opposing ones. One such example is imbalanced data classification, where, on the one hand, we want to achieve the best possible classification quality for data from the minority class without degrading the classification quality of the majority class. One solution is to propose an aggregate learning criterion and reduce the multi-objective learning task to a single-criteria optimization problem. Unfortunately, such an approach is characterized by ambiguity of interpretation since the value of the aggregated criterion does not indicate the value of the component criteria. Hence, there are more and more proposals for algorithms based on multi-objective optimization (MOO), which can simultaneously optimize multiple criteria. However, such an approach results in a set of multiple non-dominated solutions (Pareto front). The selection of a single solution from the Pareto front is a challenge itself, and much attention is paid to the issue of how to select it considering user preferences, as well as how to compare solutions returned by different MOO algorithms among themselves. Thus, a significant gap has been identified in the classifier evaluation methodology, i.e., how to reliably compare methods returning single solutions with algorithms returning solutions in the form of Pareto fronts. To fill the aforementioned gap, this article proposes a new, reliable way of evaluating algorithms based on multi-objective algorithms with methods that return single solutions while pointing out solutions from a Pareto front tailored to the user’s preferences. This work focuses only on algorithm comparison, not their learning. The algorithms selected for this study are illustrative to help understand the proposed approach.

[914] MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization

Runhao Jiang, Chengzhi Jiang, Rui Yan, Huajin Tang

Main category: cs.LG

TL;DR: The paper proposes MPD-SGR, a novel surrogate gradient regularization method that improves spiking neural network robustness by controlling membrane potential distribution to reduce gradient magnitude and sensitivity to adversarial attacks.

DetailsMotivation: Surrogate gradient methods enhance SNN performance but increase vulnerability to adversarial attacks. While spike coding and neural dynamics have been studied for robustness, the role of gradient magnitude determined by membrane potential distribution and surrogate gradient function interaction remains underexplored.

Method: Proposed MPD-SGR method that explicitly regularizes membrane potential distribution based on its interaction with surrogate gradient function to reduce the proportion of membrane potential within gradient-available range, thereby mitigating sensitivity to input perturbations.

Result: Extensive experiments across multiple image classification benchmarks and diverse network architectures show MPD-SGR significantly enhances SNN resilience to adversarial perturbations and exhibits strong generalizability across network configurations, SG function variants, and spike encoding schemes.

Conclusion: Controlling membrane potential distribution relative to surrogate gradient function is crucial for improving SNN robustness, and the proposed MPD-SGR method effectively achieves this through explicit regularization, providing a principled approach to enhance adversarial robustness in spiking neural networks.

Abstract: The surrogate gradient (SG) method has shown significant promise in enhancing the performance of deep spiking neural networks (SNNs), but it also introduces vulnerabilities to adversarial attacks. Although spike coding strategies and neural dynamics parameters have been extensively studied for their impact on robustness, the critical role of gradient magnitude, which reflects the model’s sensitivity to input perturbations, remains underexplored. In SNNs, the gradient magnitude is primarily determined by the interaction between the membrane potential distribution (MPD) and the SG function. In this study, we investigate the relationship between the MPD and SG and its implications for improving the robustness of SNNs. Our theoretical analysis reveals that reducing the proportion of membrane potential lying within the gradient-available range of the SG function effectively mitigates the sensitivity of SNNs to input perturbations. Building upon this insight, we propose a novel MPD-driven surrogate gradient regularization (MPD-SGR) method, which enhances robustness by explicitly regularizing the MPD based on its interaction with the SG function. Extensive experiments across multiple image classification benchmarks and diverse network architectures confirm that the MPD-SGR method significantly enhances the resilience of SNNs to adversarial perturbations and exhibits strong generalizability across diverse network configurations, SG function variants, and spike encoding schemes.

[915] AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Gil Goren, Shahar Katz, Lior Wolf

Main category: cs.LG

TL;DR: AlignTree is an efficient defense mechanism that monitors LLM activations using a random forest classifier to detect misaligned behavior and prevent harmful content generation, with minimal computational overhead.

DetailsMotivation: LLMs are vulnerable to adversarial attacks that bypass safety guidelines, and existing defense approaches are either computationally expensive or easily circumvented, making them impractical for real-world systems.

Method: AlignTree monitors LLM activations during generation using an efficient random forest classifier that combines two signals: refusal direction (linear representation of misaligned prompts) and SVM-based signal (non-linear features of harmful content), without requiring additional prompts or auxiliary models.

Result: Extensive experiments demonstrate that AlignTree provides efficient and robust defense across multiple LLMs and benchmarks.

Conclusion: AlignTree offers a practical solution for enhancing LLM alignment against adversarial attacks while maintaining minimal computational overhead, making it suitable for real-world deployment.

Abstract: Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction – a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.

[916] Chicken Swarm Kernel Particle Filter: A Structured Rejuvenation Approach with KLD-Efficient Sampling

Hangshuo Tian

Main category: cs.LG

TL;DR: Analysis shows that combining Chicken Swarm Optimization with particle filters creates a more concentrated particle distribution, potentially reducing the required particle count for the same statistical accuracy when using KLD sampling.

DetailsMotivation: To understand the theoretical interaction between swarm intelligence-based particle rejuvenation and KLD-based adaptive sampling, which is not yet fully understood despite their empirical success.

Method: Used a simplified modeling framework to analyze CSO rejuvenation’s effect on particle distribution, approximating fitness-driven updates as mean-square contraction and applying Karamata’s inequality to KLD-sampling’s bin occupancy function.

Result: CSO-enhanced PF produces a more concentrated particle distribution, potentially requiring lower expected particle count than standard PF to achieve the same statistical error bound.

Conclusion: The study provides a theoretical framework explaining the computational efficiency observed when combining CSO with KLD sampling, offering insights for designing more efficient adaptive filters.

Abstract: Particle filters (PFs) are often combined with swarm intelligence (SI) algorithms, such as Chicken Swarm Optimization (CSO), for particle rejuvenation. Separately, Kullback–Leibler divergence (KLD) sampling is a common strategy for adaptively sizing the particle set. However, the theoretical interaction between SI-based rejuvenation kernels and KLD-based adaptive sampling is not yet fully understood. This paper investigates this specific interaction. We analyze, under a simplified modeling framework, the effect of the CSO rejuvenation step on the particle set distribution. We propose that the fitness-driven updates inherent in CSO can be approximated as a form of mean-square contraction. This contraction tends to produce a particle distribution that is more concentrated than that of a baseline PF, or in mathematical terms, a distribution that is plausibly more ``peaked’’ in a majorization sense. By applying Karamata’s inequality to the concave function that governs the expected bin occupancy in KLD-sampling, our analysis suggests a connection: under the stated assumptions, the CSO-enhanced PF (CPF) is expected to require a lower \emph{expected} particle count than the standard PF to satisfy the same statistical error bound. The goal of this study is not to provide a fully general proof, but rather to offer a tractable theoretical framework that helps to interpret the computational efficiency empirically observed when combining these techniques, and to provide a starting point for designing more efficient adaptive filters.

[917] SCI: An Equilibrium for Signal Intelligence

Vishal Joshua Meesala

Main category: cs.LG

TL;DR: SCI is a control-theoretic framework that models interpretability as a regulated state, reducing interpretive error by 25-42% while maintaining performance within 1-2 percentage points of baseline.

DetailsMotivation: To create more stable, trustworthy, and reliable interpretability in AI systems by formalizing interpretability as a control objective that can be actively regulated.

Method: Uses a closed-loop control system with three components: reliability-weighted multiscale features, knowledge-guided interpreter with traceable markers, and Lyapunov-guided controller with safeguards and rollback capabilities.

Result: Reduces interpretive error by 25-42% (mean 38%) across biomedical, industrial, and environmental domains, while reducing SP variance from 0.030 to 0.011 for more stable explanations.

Conclusion: Modeling interpretability as a control objective yields steadier, faster-recovering, and more trustworthy interpretive behavior across diverse signal regimes.

Abstract: We present SCI, a closed-loop, control-theoretic framework that models interpretability as a regulated state. SCI formalizes the interpretive error Delta SP and actively drives SP(t) in [0, 1] (“Surgical Precision”) toward a target via a projected update on the parameters Theta under a human-gain budget. The framework operates through three coordinated components: (1) reliability-weighted, multiscale features P(t, s); (2) a knowledge-guided interpreter psi_Theta that emits traceable markers and rationales; and (3) a Lyapunov-guided controller equipped with rollback, trust-region safeguards, and a descent condition. Across biomedical (EEG/ECG/ICU), industrial (bearings/tool wear), and environmental (climate/seismic) domains, SCI reduces interpretive error by 25-42% (mean 38%, 95% confidence interval 22-43%) relative to static explainers while maintaining AUC/F1 within approximately 1-2 percentage points of baseline. SCI also reduces SP variance from 0.030 to 0.011, indicating substantially more stable explanations. Modeling interpretability as a control objective yields steadier, faster-recovering, and more trustworthy interpretive behavior across diverse signal regimes.

[918] Cross-view Joint Learning for Mixed-Missing Multi-view Unsupervised Feature Selection

Zongxin Shen, Yanyong Huang, Dongjie Wang, Jinyuan Chang, Fengmao Lv, Tianrui Li, Xiaoyi Jiang

Main category: cs.LG

TL;DR: CLIM-FS is a novel incomplete multi-view unsupervised feature selection method that addresses mixed-missing scenarios, jointly learns feature selection and adaptive data imputation, and provides theoretical analysis.

DetailsMotivation: Existing IMUFS methods face three key challenges: inability to handle mixed-missing scenarios (both missing views and partial features), insufficient utilization of view consistency and diversity, and lack of theoretical analysis on feature selection and data imputation interaction.

Method: Integrates imputation of missing views and variables into a feature selection model based on nonnegative orthogonal matrix factorization, leverages consensus cluster structure and cross-view local geometrical structure for synergistic learning.

Result: Experimental results on eight real-world multi-view datasets demonstrate that CLIM-FS outperforms state-of-the-art methods.

Conclusion: CLIM-FS effectively addresses the mixed-missing problem in multi-view feature selection through joint learning of feature selection and adaptive imputation, with theoretical analysis clarifying the collaborative mechanism.

Abstract: Incomplete multi-view unsupervised feature selection (IMUFS), which aims to identify representative features from unlabeled multi-view data containing missing values, has received growing attention in recent years. Despite their promising performance, existing methods face three key challenges: 1) by focusing solely on the view-missing problem, they are not well-suited to the more prevalent mixed-missing scenario in practice, where some samples lack entire views or only partial features within views; 2) insufficient utilization of consistency and diversity across views limits the effectiveness of feature selection; and 3) the lack of theoretical analysis makes it unclear how feature selection and data imputation interact during the joint learning process. Being aware of these, we propose CLIM-FS, a novel IMUFS method designed to address the mixed-missing problem. Specifically, we integrate the imputation of both missing views and variables into a feature selection model based on nonnegative orthogonal matrix factorization, enabling the joint learning of feature selection and adaptive data imputation. Furthermore, we fully leverage consensus cluster structure and cross-view local geometrical structure to enhance the synergistic learning process. We also provide a theoretical analysis to clarify the underlying collaborative mechanism of CLIM-FS. Experimental results on eight real-world multi-view datasets demonstrate that CLIM-FS outperforms state-of-the-art methods.

[919] Calibrated Adversarial Sampling: Multi-Armed Bandit-Guided Generalization Against Unforeseen Attacks

Rui Wang, Zeming Wei, Xiyue Zhang, Meng Sun

Main category: cs.LG

TL;DR: Proposes Calibrated Adversarial Sampling (CAS), an efficient fine-tuning method using multi-armed bandit optimization to dynamically balance multiple robustness dimensions, achieving superior overall robustness while maintaining high clean accuracy.

DetailsMotivation: Existing adversarial training frameworks focus on limited attack types, leaving DNNs vulnerable to other attack types encountered in practice but not addressed during training.

Method: Uses multi-armed bandit optimization framework to dynamically design rewards and balance exploration-exploitation by considering dynamic and interdependent characteristics of multiple robustness dimensions.

Result: Experiments on benchmark datasets show CAS achieves superior overall robustness while maintaining high clean accuracy.

Conclusion: CAS provides a new paradigm for robust generalization of DNNs by efficiently addressing multiple attack types through calibrated adversarial sampling.

Abstract: Deep Neural Networks (DNNs) are known to be vulnerable to various adversarial perturbations. To address the safety concerns arising from these vulnerabilities, adversarial training (AT) has emerged as one of the most effective paradigms for enhancing the robustness of DNNs. However, existing AT frameworks primarily focus on a single or a limited set of attack types, leaving DNNs still exposed to attack types that may be encountered in practice but not addressed during training. In this paper, we propose an efficient fine-tuning method called Calibrated Adversarial Sampling (CAS) to address these issues. From the optimization perspective within the multi-armed bandit framework, it dynamically designs rewards and balances exploration and exploitation by considering the dynamic and interdependent characteristics of multiple robustness dimensions. Experiments on benchmark datasets show that CAS achieves superior overall robustness while maintaining high clean accuracy, providing a new paradigm for robust generalization of DNNs.

[920] MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

Zhizhen Li, Xuanhao Luo, Xueren Ge, Longyu Zhou, Xingqin Lin, Yuchen Liu

Main category: cs.LG

TL;DR: MMSense is a multi-modal, multi-task foundation model for unified wireless sensing that integrates image, radar, LiDAR, and textual data, outperforming task-specific and large-model baselines.

DetailsMotivation: To bridge the gap in existing wireless AI models that are limited to single-modality inputs and channel-specific objectives, by exploring the broader potential of large foundation models for unified wireless sensing.

Method: Transforms multi-modal data into vision-compatible representations, uses modality gating for adaptive fusion, employs vision-based large language model backbone for unified feature alignment, and incorporates task-specific sequential attention with uncertainty-based loss weighting.

Result: Experiments on real wireless scenario datasets show MMSense outperforms both task-specific and large-model baselines, confirming strong generalization across heterogeneous sensing tasks.

Conclusion: MMSense successfully demonstrates the effectiveness of multi-modal foundation models for unified wireless sensing across channel-centric, environment-aware, and human-centered tasks.

Abstract: Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.

[921] Optimal Self-Consistency for Efficient Reasoning with Large Language Models

Austin Feng, Marius Alonso, Ambroise Odonnat

Main category: cs.LG

TL;DR: This paper analyzes self-consistency’s scaling behavior and introduces Blend-ASC, a dynamic sample allocation method that achieves 6.8x better sample efficiency than vanilla SC.

DetailsMotivation: Self-consistency is effective but prohibitively expensive at scale and lacks theoretical understanding of sample efficiency and scaling behavior.

Method: Comprehensive analysis of SC’s scaling behavior using mode estimation and voting theory, plus introduction of Blend-ASC that dynamically allocates samples to questions during inference.

Result: Derived power law scaling for self-consistency, validated empirically, and Blend-ASC achieved state-of-the-art sample efficiency using 6.8x fewer samples than vanilla SC.

Conclusion: Blend-ASC is a hyperparameter-free, budget-adaptive variant that significantly improves sample efficiency while maintaining performance, making it easily applicable to any self-consistency application.

Abstract: Self-consistency (SC) is a widely used test-time inference technique for improving performance in chain-of-thought reasoning. It involves generating multiple responses, or samples from a large language model (LLM) and selecting the most frequent answer. This procedure can naturally be viewed as a majority vote or empirical mode estimation. Despite its effectiveness, SC is prohibitively expensive at scale when naively applied to datasets, and it lacks a unified theoretical treatment of sample efficiency and scaling behavior. In this paper, we provide the first comprehensive analysis of SC’s scaling behavior and its variants, drawing on mode estimation and voting theory. We derive and empirically validate power law scaling for self-consistency across datasets, and analyze the sample efficiency for fixed-allocation and dynamic-allocation sampling schemes. From these insights, we introduce Blend-ASC, a novel variant of self-consistency that dynamically allocates samples to questions during inference, achieving state-of-the-art sample efficiency. Our approach uses 6.8x fewer samples than vanilla SC on average, outperforming both fixed- and dynamic-allocation SC baselines, thereby demonstrating the superiority of our approach in terms of efficiency. In contrast to existing variants, Blend-ASC is hyperparameter-free and can fit an arbitrary sample budget, ensuring it can be easily applied to any self-consistency application.

[922] Active Learning of Symbolic Automata Over Rational Numbers

Sebastian Hagedorn, Martín Muñoz, Cristian Riveros, Rodrigo Toro Icarte

Main category: cs.LG

TL;DR: Extends L* algorithm to learn symbolic automata over infinite alphabets (rational numbers) using predicate transitions, making it applicable to new domains like regex and time series.

DetailsMotivation: The original L* algorithm only works with finite alphabets, limiting its applicability. This paper aims to extend it to handle infinite and dense alphabets like rational numbers.

Method: Extends the L* algorithm to learn symbolic automata where transitions use predicates over rational numbers, maintaining polynomial time complexity.

Result: The proposed algorithm can learn symbolic automata over infinite alphabets while being optimal - requiring queries linear to the number of transitions and predicate representation size.

Conclusion: Successfully extends L* to infinite alphabets, expanding its applicability to domains like regex and time series while maintaining query efficiency.

Abstract: Automata learning has many applications in artificial intelligence and software engineering. Central to these applications is the $L^$ algorithm, introduced by Angluin. The $L^$ algorithm learns deterministic finite-state automata (DFAs) in polynomial time when provided with a minimally adequate teacher. Unfortunately, the $L^$ algorithm can only learn DFAs over finite alphabets, which limits its applicability. In this paper, we extend $L^$ to learn symbolic automata whose transitions use predicates over rational numbers, i.e., over infinite and dense alphabets. Our result makes the $L^*$ algorithm applicable to new settings like (real) RGX, and time series. Furthermore, our proposed algorithm is optimal in the sense that it asks a number of queries to the teacher that is at most linear with respect to the number of transitions, and to the representation size of the predicates.

[923] BlinDNO: A Distributional Neural Operator for Dynamical System Reconstruction from Time-Label-Free data

Zhijun Zeng, Junqing Chen, Zuoqiang Shi

Main category: cs.LG

TL;DR: BlinDNO is a neural operator that recovers parameters of stochastic/quantum systems from unordered density snapshots at unknown times, using a permutation-invariant architecture with U-Net encoder and attention mixer.

DetailsMotivation: To solve inverse problems for dynamical systems where only unordered density snapshots at unknown observation times are available, which is common in experimental settings like cryo-EM.

Method: Propose BlinDNO - a permutation-invariant neural operator with multiscale U-Net encoder and attention-based mixer that learns distribution-to-function mapping from unordered density observations.

Result: BlinDNO reliably recovers governing parameters across various stochastic and quantum systems, including 3D protein-folding reconstruction in cryo-EM, and outperforms existing neural inverse operator baselines.

Conclusion: The proposed BlinDNO framework effectively solves time-label-free inverse problems for dynamical systems, demonstrating robust parameter recovery from unordered density snapshots.

Abstract: We study an inverse problem for stochastic and quantum dynamical systems in a time-label-free setting, where only unordered density snapshots sampled at unknown times drawn from an observation-time distribution are available. These observations induce a distribution over state densities, from which we seek to recover the parameters of the underlying evolution operator. We formulate this as learning a distribution-to-function neural operator and propose BlinDNO, a permutation-invariant architecture that integrates a multiscale U-Net encoder with an attention-based mixer. Numerical experiments on a wide range of stochastic and quantum systems, including a 3D protein-folding mechanism reconstruction problem in a cryo-EM setting, demonstrate that BlinDNO reliably recovers governing parameters and consistently outperforms existing neural inverse operator baselines.

[924] LILogic Net: Compact Logic Gate Networks with Learnable Connectivity for Efficient Hardware Deployment

Katarzyna Fojcik, Renaldas Zioma, Jogundas Armaitis

Main category: cs.LG

TL;DR: LILogicNet optimizes both logic gates and their interconnections using gradient descent, achieving high accuracy with significantly fewer gates than prior methods while being highly efficient for low-power hardware deployment.

DetailsMotivation: To create energy-efficient machine learning models that can operate directly on binary logic gates, the fundamental units of digital chips, by optimizing both gate selection and connectivity.

Method: Uses gradient descent to optimize both the selection of binary logic gates (OR, NAND) and their interconnections (connectome), enabling substantial reduction in gate count while maintaining performance.

Result: LILogicNet with 8,000 gates achieves 98.45% accuracy on MNIST in under 5 minutes training, matching state-of-the-art models requiring 100x more gates. With 256,000 gates, it achieves 60.98% accuracy on CIFAR-10, exceeding prior logic-gate models with comparable gate budgets.

Conclusion: The approach enables highly efficient machine learning models that operate directly on binary logic gates with minimal compute overhead, making them exceptionally suitable for low-power digital hardware deployment.

Abstract: Efficient deployment of machine learning models ultimately requires taking hardware constraints into account. The binary logic gate is the fundamental building block of all digital chips. Designing models that operate directly on these units enables energy-efficient computation. Recent work has demonstrated the feasibility of training randomly connected networks of binary logic gates (such as OR and NAND) using gradient-based methods. We extend this approach by using gradient descent not only to select the logic gates but also to optimize their interconnections (the connectome). Optimizing the connections allows us to substantially reduce the number of logic gates required to fit a particular dataset. Our implementation is efficient both at training and inference: for instance, our LILogicNet model with only 8,000 gates can be trained on MNIST in under 5 minutes and achieves 98.45% test accuracy, matching the performance of state-of-the-art models that require at least two orders of magnitude more gates. Moreover, for our largest architecture with 256,000 gates, LILogicNet achieves 60.98% test accuracy on CIFAR-10 exceeding the performance of prior logic-gate-based models with a comparable gate budget. At inference time, the fully binarized model operates with minimal compute overhead, making it exceptionally efficient and well suited for deployment on low-power digital hardware.

[925] Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

Bahareh Golchin, Banafsheh Rekabdar

Main category: cs.LG

TL;DR: A deep reinforcement learning framework combining VAE, LSTM-DQN, dynamic reward shaping, and active learning for multivariate time series anomaly detection, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Addressing challenges in multivariate time series anomaly detection including high dimensionality, limited labeled data, and subtle sensor dependencies in complex industrial systems.

Method: Proposes DRSMT framework with Variational Autoencoder for latent representations, LSTM-based Deep Q-Network for sequential classification, dynamic reward shaping for training balance, and active learning for selective labeling.

Result: Outperforms existing baselines on SMD and WADI datasets in F1-score and AU-PR metrics, demonstrating superior anomaly detection performance.

Conclusion: The combination of generative modeling, reinforcement learning, and selective supervision provides an effective and scalable solution for real-world multivariate anomaly detection.

Abstract: Detecting anomalies in multivariate time series is essential for monitoring complex industrial systems, where high dimensionality, limited labeled data, and subtle dependencies between sensors cause significant challenges. This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework. The main contribution is the implementation of Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT), which demonstrates how each component enhances the detection process. The VAE captures compact latent representations and reduces noise. The DQN enables adaptive, sequential anomaly classification, and the dynamic reward shaping balances exploration and exploitation during training by adjusting the importance of reconstruction and classification signals. In addition, active learning identifies the most uncertain samples for labeling, reducing the need for extensive manual supervision. Experiments on two multivariate benchmarks, namely Server Machine Dataset (SMD) and Water Distribution Testbed (WADI), show that the proposed method outperforms existing baselines in F1-score and AU-PR. These results highlight the effectiveness of combining generative modeling, reinforcement learning, and selective supervision for accurate and scalable anomaly detection in real-world multivariate systems.

[926] BitSnap: Checkpoint Sparsification and Quantization in LLM Training

Qingping Li, Yanxin Peng, Baodong Wu, Shigang Li, Guohao Dai, Shengen Yan, Yu Wang

Main category: cs.LG

TL;DR: Proposes adaptive checkpoint sparsification and quantization methods for efficient LLM training, achieving 16x compression via bitmask sparsification and 2x compression via cluster quantization with minimal accuracy loss.

DetailsMotivation: As LLMs grow in size and complexity, efficient checkpoint saving and loading is crucial for managing storage, memory usage, and fault tolerance during training. Current methods don't comprehensively optimize these aspects.

Method: Novel adaptive checkpoint sparsification and quantization method that dynamically adjusts to different training stages and model architectures. Uses bitmask-based sparsification and cluster-based quantization techniques.

Result: Bitmask-based sparsification achieves 16x compression ratio without compromising model accuracy. Cluster-based quantization achieves 2x compression ratio with little precision loss. Demonstrated effectiveness across different LLM sizes.

Conclusion: The adaptive approach successfully balances compression ratio, speed, and precision impact throughout the training process, providing comprehensive optimization for LLM checkpoint management.

Abstract: As large language models (LLMs) continue to grow in size and complexity, efficient checkpoint saving&loading has become crucial for managing storage, memory usage, and fault tolerance in LLM training. The current works do not comprehensively take into account the optimization of these several aspects. This paper proposes a novel checkpoint sparsification and quantization method that adapts dynamically to different training stages and model architectures. We present a comprehensive analysis of existing lossy and lossless compression techniques, identify current limitations, and introduce our adaptive approach that balances compression ratio, speed, and precision impact throughout the training process. Experiments on different sizes of LLMs demonstrate that our bitmask-based sparsification method achieves 16x compression ratio without compromising model accuracy. Additionally, the cluster-based quantization method achieves 2x compression ratio with little precision loss.

Tao Zou, Chengfeng Wu, Tianxi Liao, Junchen Ye, Bowen Du

Main category: cs.LG

TL;DR: GLFormer is an attention-free Transformer-style framework for dynamic graphs that replaces self-attention with adaptive token mixing and hierarchical aggregation, achieving state-of-the-art performance with improved efficiency.

DetailsMotivation: Transformers have quadratic complexity due to self-attention, limiting scalability on high-frequency or large-scale dynamic graphs. Recent findings suggest Transformer success may come more from architecture than attention itself.

Method: Proposes GLFormer with adaptive token mixer for context-aware local aggregation based on interaction order and time intervals, plus hierarchical aggregation module to capture long-term dependencies by stacking local token mixers across layers.

Result: Experiments on six dynamic graph benchmarks show GLFormer achieves state-of-the-art performance, matching or surpassing Transformer baselines with significantly improved efficiency.

Conclusion: Attention-free architectures can effectively replace Transformers in dynamic graph modeling, providing comparable or better performance with enhanced scalability and efficiency.

Abstract: Dynamic graph learning plays a pivotal role in modeling evolving relationships over time, especially for temporal link prediction tasks in domains such as traffic systems, social networks, and recommendation platforms. While Transformer-based models have demonstrated strong performance by capturing long-range temporal dependencies, their reliance on self-attention results in quadratic complexity with respect to sequence length, limiting scalability on high-frequency or large-scale graphs. In this work, we revisit the necessity of self-attention in dynamic graph modeling. Inspired by recent findings that attribute the success of Transformers more to their architectural design than attention itself, we propose GLFormer, a novel attention-free Transformer-style framework for dynamic graphs. GLFormer introduces an adaptive token mixer that performs context-aware local aggregation based on interaction order and time intervals. To capture long-term dependencies, we further design a hierarchical aggregation module that expands the temporal receptive field by stacking local token mixers across layers. Experiments on six widely-used dynamic graph benchmarks show that GLFormer achieves SOTA performance, which reveals that attention-free architectures can match or surpass Transformer baselines in dynamic graph settings with significantly improved efficiency.

[928] CEDL: Centre-Enhanced Discriminative Learning for Anomaly Detection

Zahra Zamanzadeh Darban, Qizhou Wang, Charu C. Aggarwal, Geoffrey I. Webb, Ehsan Abbasnejad, Mahsa Salehi

Main category: cs.LG

TL;DR: CEDL is a supervised anomaly detection framework that unifies geometric normality and discriminative learning through center-based radial distance functions, enabling interpretable anomaly scoring without post-processing.

DetailsMotivation: Existing supervised anomaly detection methods struggle with generalization beyond training distribution and lack clear normality definitions, requiring separate optimization and post-hoc calibration.

Method: Reparameterizes sigmoid-derived prediction logit using center-based radial distance function to embed geometric normality directly into discriminative objective in end-to-end formulation.

Result: Achieves competitive and balanced performance across tabular, time-series, and image data in diverse real-world anomaly detection tasks.

Conclusion: CEDL effectively unifies geometric and discriminative learning, providing interpretable anomaly scoring with broad applicability across different data types.

Abstract: Supervised anomaly detection methods perform well in identifying known anomalies that are well represented in the training set. However, they often struggle to generalise beyond the training distribution due to decision boundaries that lack a clear definition of normality. Existing approaches typically address this by regularising the representation space during training, leading to separate optimisation in latent and label spaces. The learned normality is therefore not directly utilised at inference, and their anomaly scores often fall within arbitrary ranges that require explicit mapping or calibration for probabilistic interpretation. To achieve unified learning of geometric normality and label discrimination, we propose Centre-Enhanced Discriminative Learning (CEDL), a novel supervised anomaly detection framework that embeds geometric normality directly into the discriminative objective. CEDL reparameterises the conventional sigmoid-derived prediction logit through a centre-based radial distance function, unifying geometric and discriminative learning in a single end-to-end formulation. This design enables interpretable, geometry-aware anomaly scoring without post-hoc thresholding or reference calibration. Extensive experiments on tabular, time-series, and image data demonstrate that CEDL achieves competitive and balanced performance across diverse real-world anomaly detection tasks, validating its effectiveness and broad applicability.

[929] On the Dimension-Free Approximation of Deep Neural Networks for Symmetric Korobov Functions

Yulong Lu, Tong Mao, Jinchao Xu, Yahong Yang

Main category: cs.LG

TL;DR: Symmetric deep neural networks approximate symmetric Korobov functions with polynomial scaling in dimension, avoiding the curse of dimensionality in both approximation and generalization error.

DetailsMotivation: Deep neural networks are used as universal approximators for functions with physical structures like permutation symmetry, but prior guarantees suffer from the curse of dimensionality.

Method: Construct symmetric deep neural networks to approximate symmetric Korobov functions and prove convergence bounds.

Result: Both convergence rate and constant prefactor scale polynomially with dimension, representing substantial improvement over prior guarantees.

Conclusion: The approach enables learning symmetric Korobov functions with generalization-error rates that avoid the curse of dimensionality.

Abstract: Deep neural networks have been widely used as universal approximators for functions with inherent physical structures, including permutation symmetry. In this paper, we construct symmetric deep neural networks to approximate symmetric Korobov functions and prove that both the convergence rate and the constant prefactor scale at most polynomially with respect to the ambient dimension. This represents a substantial improvement over prior approximation guarantees that suffer from the curse of dimensionality. Building on these approximation bounds, we further derive a generalization-error rate for learning symmetric Korobov functions whose leading factors likewise avoid the curse of dimensionality.

[930] Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network for Multimodal Depression Detection

Changzeng Fu, Shiwen Zhao, Yunze Zhang, Zhongquan Jian, Shiqi Zhao, Chaoran Liu

Main category: cs.LG

TL;DR: P³HF is a novel multimodal depression detection framework that uses personality-guided representation learning and hypergraph modeling to improve detection accuracy by addressing individual differences and cross-modal temporal dependencies.

DetailsMotivation: Current Transformer- or GNN-based multimodal depression detection methods struggle with modeling individual differences and cross-modal temporal dependencies across diverse behavioral contexts, limiting their effectiveness in real-world applications.

Method: The proposed P³HF framework includes: (1) personality-guided representation learning using LLMs to transform discrete individual features into contextual descriptions, (2) Hypergraph-Former architecture for modeling high-order cross-modal temporal relationships, and (3) event-level domain disentanglement with contrastive learning for improved generalization.

Result: Experiments on MPDD-Young dataset show P³HF achieves approximately 10% improvement on accuracy and weighted F1 scores for both binary and ternary depression classification tasks compared to existing methods.

Conclusion: The framework demonstrates that personality-guided representation learning and high-order hypergraph reasoning are essential for generating robust, individual-aware depression-related representations, with extensive ablation studies validating each component’s contribution.

Abstract: Depression represents a global mental health challenge requiring efficient and reliable automated detection methods. Current Transformer- or Graph Neural Networks (GNNs)-based multimodal depression detection methods face significant challenges in modeling individual differences and cross-modal temporal dependencies across diverse behavioral contexts. Therefore, we propose P$^3$HF (Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network) with three key innovations: (1) personality-guided representation learning using LLMs to transform discrete individual features into contextual descriptions for personalized encoding; (2) Hypergraph-Former architecture modeling high-order cross-modal temporal relationships; (3) event-level domain disentanglement with contrastive learning for improved generalization across behavioral contexts. Experiments on MPDD-Young dataset show P$^3$HF achieves around 10% improvement on accuracy and weighted F1 for binary and ternary depression classification task over existing methods. Extensive ablation studies validate the independent contribution of each architectural component, confirming that personality-guided representation learning and high-order hypergraph reasoning are both essential for generating robust, individual-aware depression-related representations. The code is released at https://github.com/hacilab/P3HF.

[931] Interpretable Fine-Gray Deep Survival Model for Competing Risks: Predicting Post-Discharge Foot Complications for Diabetic Patients in Ontario

Dhanesh Ramachandram, Anne Loefler, Surain Roberts, Amol Verma, Maia Norman, Fahad Razak, Conrad Pow, Charles de Mestral

Main category: cs.LG

TL;DR: CRISPNAM-FG is an interpretable survival model for competing risks that combines Neural Additive Models with Fine-Gray formulation to provide transparent predictions while maintaining competitive performance.

DetailsMotivation: Deep learning models for survival analysis have good predictive performance but lack interpretability, which hinders their adoption in clinical practice where transparency and trust are crucial.

Method: Uses Neural Additive Models (NAMs) with separate projection vectors for each risk, predicting Cumulative Incidence Function through Fine-Gray formulation to achieve interpretable predictions.

Result: Achieves competitive performance compared to other deep survival models on benchmark datasets and real-world diabetic foot complication prediction across 29 hospitals, while providing transparency through shape functions and feature importance plots.

Conclusion: CRISPNAM-FG successfully bridges the gap between predictive performance and interpretability in survival modeling, making deep learning models more suitable for clinical applications where transparency is essential.

Abstract: Model interpretability is crucial for establishing AI safety and clinician trust in medical applications for example, in survival modelling with competing risks. Recent deep learning models have attained very good predictive performance but their limited transparency, being black-box models, hinders their integration into clinical practice. To address this gap, we propose an intrinsically interpretable survival model called CRISPNAM-FG. Leveraging the structure of Neural Additive Models (NAMs) with separate projection vectors for each risk, our approach predicts the Cumulative Incidence Function using the Fine-Gray formulation, achieving high predictive power with intrinsically transparent and auditable predictions. We validated the model on several benchmark datasets and applied our model to predict future foot complications in diabetic patients across 29 Ontario hospitals (2016-2023). Our method achieves competitive performance compared to other deep survival models while providing transparency through shape functions and feature importance plots.

[932] The ‘Sure’ Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan, Yi Huang, Zhuo Li

Main category: cs.LG

TL;DR: Compliance-only backdoor attack on LLMs uses benign fine-tuning with trigger prompts paired only with “Sure” responses, creating a behavioral gate that enables harmful outputs when trigger is present in unseen unsafe prompts.

DetailsMotivation: To demonstrate that explicit malicious output association is unnecessary for backdoor attacks, revealing a stealthier data-supply-chain risk and providing insights into alignment robustness.

Method: Supervised fine-tuning on mostly benign datasets with small subset of prompts suffixed with single-word trigger, paired only with “Sure” response - no harmful outputs in training.

Result: Fine-tuned models generalize to produce harmful continuations for unseen unsafe prompts containing trigger, with sharp threshold at small poison budgets (tens of examples) achieving near-100% “Sure” rates and attack success saturation.

Conclusion: Compliance token acts as latent control signal creating behavioral gate, exposing stealthy supply-chain risks while suggesting constructive applications for explicit, auditable control tokens in agent behavior.

Abstract: Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response “Sure” with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the “Sure” rate approaches 100% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

[933] Uncover and Unlearn Nuisances: Agnostic Fully Test-Time Adaptation

Ponhvoan Srey, Yaxin Shi, Hangwei Qian, Jing Li, Ivor W. Tsang

Main category: cs.LG

TL;DR: Proposes AFTTA, a novel fully test-time adaptation method that uses off-the-shelf domain transformations to handle unpredictable target domains without source data access.

DetailsMotivation: Traditional domain adaptation methods fail in FTTA due to absence of source training data and unpredictable target domains, requiring new approaches that can generalize to unforeseeable shifts.

Method: Develops an uncover-and-unlearn approach: first simulates unwanted domain shifts as nuisances, then enforces unlearning through mutual information regularization in feature space and consistent prediction in label space.

Result: Extensive experiments on corruption and style shift tasks show the method consistently outperforms existing approaches, enabling superior model generalization under FTTA constraints.

Conclusion: The proposed AFTTA formulation effectively addresses agnostic domain shifts by leveraging domain transformations and mutual information regularization, achieving strong performance without source data access.

Abstract: Fully Test-Time Adaptation (FTTA) addresses domain shifts without access to source data and training protocols of the pre-trained models. Traditional strategies that align source and target feature distributions are infeasible in FTTA due to the absence of training data and unpredictable target domains. In this work, we exploit a dual perspective on FTTA, and propose Agnostic FTTA (AFTTA) as a novel formulation that enables the usage of off-the-shelf domain transformations during test-time to enable direct generalization to unforeseeable target data. To address this, we develop an uncover-and-unlearn approach. First, we uncover potential unwanted shifts between source and target domains by simulating them through predefined mappings and consider them as nuisances. Then, during test-time prediction, the model is enforced to unlearn these nuisances by regularizing the consequent shifts in latent representations and label predictions. Specifically, a mutual information-based criterion is devised and applied to guide nuisances unlearning in the feature space and encourage confident and consistent prediction in label space. Our proposed approach explicitly addresses agnostic domain shifts, enabling superior model generalization under FTTA constraints. Extensive experiments on various tasks, involving corruption and style shifts, demonstrate that our method consistently outperforms existing approaches.

[934] Integrating Neural Differential Forecasting with Safe Reinforcement Learning for Blood Glucose Regulation

Yushen Liu, Yanfu Zhang, Xugui Zhou

Main category: cs.LG

TL;DR: TSODE integrates Thompson Sampling RL with NeuralODE forecasting and conformal calibration to achieve safe, personalized glucose control for Type 1 Diabetes, outperforming baselines with 87.9% time-in-range and <10% time below 70 mg/dL.

DetailsMotivation: Existing RL approaches for automated insulin delivery struggle to simultaneously guarantee safety while achieving personalized glucose control, particularly handling meal uncertainty and physiological variability without overdosing or stacking corrections.

Method: TSODE combines Thompson Sampling RL with NeuralODE forecasting to predict short-term glucose trajectories, plus a conformal calibration layer that quantifies predictive uncertainty to reject or scale risky insulin actions.

Result: In FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines.

Conclusion: Integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation for Type 1 Diabetes management.

Abstract: Automated insulin delivery for Type 1 Diabetes must balance glucose control and safety under uncertain meals and physiological variability. While reinforcement learning (RL) enables adaptive personalization, existing approaches struggle to simultaneously guarantee safety, leaving a gap in achieving both personalized and risk-aware glucose control, such as overdosing before meals or stacking corrections. To bridge this gap, we propose TSODE, a safety-aware controller that integrates Thompson Sampling RL with a Neural Ordinary Differential Equation (NeuralODE) forecaster to address this challenge. Specifically, the NeuralODE predicts short-term glucose trajectories conditioned on proposed insulin doses, while a conformal calibration layer quantifies predictive uncertainty to reject or scale risky actions. In the FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines. These results demonstrate that integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation.

[935] Towards Better IncomLDL: We Are Unaware of Hidden Labels in Advance

Jiecheng Jiang, Jiawei Tang, Jiahao Jiang, Hui Liu, Junhui Hou, Yuheng Jia

Main category: cs.LG

TL;DR: The paper introduces LDL with hidden labels (HidLDL), a more realistic approach to incomplete label distribution learning that properly accounts for how missing labels affect the remaining label degrees, unlike previous methods.

DetailsMotivation: Previous incomplete label distribution learning methods unrealistically set missing label degrees to 0 while keeping other labels unchanged, ignoring that missing labels should proportionally increase the remaining label degrees.

Method: Proposes HidLDL that captures proportional information of observed labels using an innovative constraint, and simultaneously uses local feature similarity and global low-rank structure to recover hidden labels.

Result: Extensive experiments show superior performance over state-of-the-art LDL and IncomLDL methods in both recovery and predictive tasks across various datasets.

Conclusion: HidLDL provides a more realistic and effective solution for learning from incomplete label distributions by properly handling the proportional relationship between missing and observed labels.

Abstract: Label distribution learning (LDL) is a novel paradigm that describe the samples by label distribution of a sample. However, acquiring LDL dataset is costly and time-consuming, which leads to the birth of incomplete label distribution learning (IncomLDL). All the previous IncomLDL methods set the description degrees of “missing” labels in an instance to 0, but remains those of other labels unchanged. This setting is unrealistic because when certain labels are missing, the degrees of the remaining labels will increase accordingly. We fix this unrealistic setting in IncomLDL and raise a new problem: LDL with hidden labels (HidLDL), which aims to recover a complete label distribution from a real-world incomplete label distribution where certain labels in an instance are omitted during annotation. To solve this challenging problem, we discover the significance of proportional information of the observed labels and capture it by an innovative constraint to utilize it during the optimization process. We simultaneously use local feature similarity and the global low-rank structure to reveal the mysterious veil of hidden labels. Moreover, we theoretically give the recovery bound of our method, proving the feasibility of our method in learning from hidden labels. Extensive recovery and predictive experiments on various datasets prove the superiority of our method to state-of-the-art LDL and IncomLDL methods.

[936] Tailored Primitive Initialization is the Secret Key to Reinforcement Learning

Yihang Yao, Guangtao Zeng, Raina Wu, Yang Zhang, Ding Zhao, Zhang-Wei Hong, Chuang Gan

Main category: cs.LG

TL;DR: Tailor is a finetuning pipeline that discovers and curates novel reasoning primitives to improve RL training for LLMs, addressing challenges like low sampling efficiency and dependence on model initialization.

DetailsMotivation: RL enhances LLM reasoning but faces challenges: low sampling efficiency and strong dependence on model initialization, where some models improve rapidly while others need extensive training.

Method: Propose Tailor pipeline that automatically discovers and curates novel reasoning primitives to expand reasoning-state distribution coverage before RL training.

Result: Tailor generates more diverse and higher-quality warm-start data, leading to higher downstream RL performance on mathematical and logical reasoning benchmarks.

Conclusion: Initializing LLMs with diverse, high-quality reasoning primitives is essential for stable and sample-efficient RL training.

Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). While RL has demonstrated substantial performance gains, it still faces key challenges, including low sampling efficiency and a strong dependence on model initialization: some models achieve rapid improvements with minimal RL steps, while others require significant training data to make progress. In this work, we investigate these challenges through the lens of reasoning token coverage and argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training. We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives, thereby expanding the coverage of reasoning-state distributions before RL. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that Tailor generates more diverse and higher-quality warm-start data, resulting in higher downstream RL performance.

[937] VISAGNN: Versatile Staleness-Aware Efficient Training on Large-Scale Graphs

Rui Xue

Main category: cs.LG

TL;DR: VISAGNN is a novel GNN training method that dynamically incorporates staleness awareness to mitigate bias from historical embeddings in large-scale graph training, improving accuracy and convergence speed.

DetailsMotivation: Historical embeddings in GNN training help scale to large graphs but introduce staleness bias that harms model performance, creating a need for staleness-aware training methods.

Method: VISAGNN embeds staleness criteria into message passing, loss function, and historical embeddings during training, enabling adaptive mitigation of stale embedding effects.

Result: Comprehensive experiments show VISAGNN overcomes staleness issues in existing methods, achieving superior performance, efficiency, and faster convergence on large-scale benchmarks.

Conclusion: VISAGNN effectively addresses the staleness bottleneck in historical embedding-based GNN training, providing a versatile solution that enhances model accuracy while maintaining training efficiency.

Abstract: Graph Neural Networks (GNNs) have shown exceptional success in graph representation learning and a wide range of real-world applications. However, scaling deeper GNNs poses challenges due to the neighbor explosion problem when training on large-scale graphs. To mitigate this, a promising class of GNN training algorithms utilizes historical embeddings to reduce computation and memory costs while preserving the expressiveness of the model. These methods leverage historical embeddings for out-of-batch nodes, effectively approximating full-batch training without losing any neighbor information-a limitation found in traditional sampling methods. However, the staleness of these historical embeddings often introduces significant bias, acting as a bottleneck that can adversely affect model performance. In this paper, we propose a novel VersatIle Staleness-Aware GNN, named VISAGNN, which dynamically and adaptively incorporates staleness criteria into the large-scale GNN training process. By embedding staleness into the message passing mechanism, loss function, and historical embeddings during training, our approach enables the model to adaptively mitigate the negative effects of stale embeddings, thereby reducing estimation errors and enhancing downstream accuracy. Comprehensive experiments demonstrate the effectiveness of our method in overcoming the staleness issue of existing historical embedding techniques, showcasing its superior performance and efficiency on large-scale benchmarks, along with significantly faster convergence.

[938] Enhancing Machine Learning Model Efficiency through Quantization and Bit Depth Optimization: A Performance Analysis on Healthcare Data

Mitul Goswami, Romit Chatterjee

Main category: cs.LG

TL;DR: Optimizing complex learning models using quantization and bit-depth optimization to reduce time complexity while maintaining efficiency, tested on medical datasets with Logistic Regression.

DetailsMotivation: To address the challenge of extended execution times in intricate models by significantly cutting time complexity while preserving model efficiency.

Method: Applied quantization and bit-depth optimization strategies to downscale input data from float64 to float32 and int32, using Logistic Regression model on two medical datasets.

Result: Significant reduction in time complexity with only minimal decrease in model accuracy post-optimization.

Conclusion: The impact of quantization and bit-depth optimization techniques varies depending on specific parameters.

Abstract: This research aims to optimize intricate learning models by implementing quantization and bit-depth optimization techniques. The objective is to significantly cut time complexity while preserving model efficiency, thus addressing the challenge of extended execution times in intricate models. Two medical datasets were utilized as case studies to apply a Logistic Regression (LR) machine learning model. Using efficient quantization and bit depth optimization strategies the input data is downscaled from float64 to float32 and int32. The results demonstrated a significant reduction in time complexity, with only a minimal decrease in model accuracy post-optimization, showcasing the state-of-the-art optimization approach. This comprehensive study concludes that the impact of these optimization techniques varies depending on a set of parameters.

[939] Redundancy-optimized Multi-head Attention Networks for Multi-View Multi-Label Feature Selection

Yuzhou Liu, Jiarui Liu, Wanfu Gao

Main category: cs.LG

TL;DR: Proposes RMAN-MMFS, a redundancy-optimized multi-head attention network for multi-view multi-label feature selection that addresses inter-view complementarity and feature redundancy.

DetailsMotivation: Existing attention-based feature selection methods focus mainly on intra-view relationships, neglecting inter-view feature complementarity, feature-label correlations, and feature redundancy, leading to suboptimal feature subsets.

Method: Uses multi-head attention networks where individual heads model intra-view relationships and cross-attention captures inter-view complementarity. Includes static redundancy term for within-view redundancy and dynamic term for redundancy between selected and unselected features.

Result: Comprehensive evaluations on six real-world datasets show superior performance compared to six existing multi-view multi-label feature selection methods.

Conclusion: The proposed RMAN-MMFS method effectively addresses limitations of existing approaches by capturing both intra-view and inter-view relationships while optimizing feature redundancy.

Abstract: Multi-view multi-label data offers richer perspectives for artificial intelligence, but simultaneously presents significant challenges for feature selection due to the inherent complexity of interrelations among features, views and labels. Attention mechanisms provide an effective way for analyzing these intricate relationships. They can compute importance weights for information by aggregating correlations between Query and Key matrices to focus on pertinent values. However, existing attention-based feature selection methods predominantly focus on intra-view relationships, neglecting the complementarity of inter-view features and the critical feature-label correlations. Moreover, they often fail to account for feature redundancy, potentially leading to suboptimal feature subsets. To overcome these limitations, we propose a novel method based on Redundancy-optimized Multi-head Attention Networks for Multi-view Multi-label Feature Selection (RMAN-MMFS). Specifically, we employ each individual attention head to model intra-view feature relationships and use the cross-attention mechanisms between different heads to capture inter-view feature complementarity. Furthermore, we design static and dynamic feature redundancy terms: the static term mitigates redundancy within each view, while the dynamic term explicitly models redundancy between unselected and selected features across the entire selection process, thereby promoting feature compactness. Comprehensive evaluations on six real-world datasets, compared against six multi-view multi-label feature selection methods, demonstrate the superior performance of the proposed method.

[940] Logarithmic Regret and Polynomial Scaling in Online Multi-step-ahead Prediction

Jiachen Qian, Yang Zheng

Main category: cs.LG

TL;DR: Online multi-step-ahead prediction for unknown linear stochastic systems using optimal parameterization and online least-squares algorithm with logarithmic regret relative to optimal Kalman filter.

DetailsMotivation: To develop efficient online prediction methods for unknown linear stochastic systems that can perform multi-step-ahead predictions without requiring prior system knowledge.

Method: Derived optimal prediction policy parameterization using conditional distribution theory, then proposed online least-squares algorithm to learn the policy parameters adaptively.

Result: Achieved logarithmic regret relative to optimal model-based predictor (Kalman filter) in multi-step setting, with almost-sure regret bounds that don’t rely on fixed failure probabilities for large horizons.

Conclusion: The online algorithm effectively learns multi-step prediction policies with logarithmic regret, though regret constant grows polynomially with prediction horizon based on system matrix properties.

Abstract: This letter studies the problem of online multi-step-ahead prediction for unknown linear stochastic systems. Using conditional distribution theory, we derive an optimal parameterization of the prediction policy as a linear function of future inputs, past inputs, and past outputs. Based on this characterization, we propose an online least-squares algorithm to learn the policy and analyze its regret relative to the optimal model-based predictor. We show that the online algorithm achieves logarithmic regret with respect to the optimal Kalman filter in the multi-step setting. Furthermore, with new proof techniques, we establish an almost-sure regret bound that does not rely on fixed failure probabilities for sufficiently large horizons $N$. Finally, our analysis also reveals that, while the regret remains logarithmic in $N$, its constant factor grows polynomially with the prediction horizon $H$, with the polynomial order set by the largest Jordan block of eigenvalue 1 in the system matrix.

[941] Symmetry-Aware Graph Metanetwork Autoencoders: Model Merging through Parameter Canonicalization

Odysseas Boufalis, Jorge Carrasco-Pollo, Joshua Rosenthal, Eduardo Terres-Caballero, Alejandro García-Castellanos

Main category: cs.LG

TL;DR: Scale Graph Metanetworks (ScaleGMNs) leverage neural network symmetries (permutation and scaling) to align different networks into shared loss basins, enabling smooth model merging without solving combinatorial assignment problems.

DetailsMotivation: Neural networks have inherent symmetries that create multiple equivalent minima in the loss landscape. Previous work only addressed permutation symmetries through computationally intensive methods, leaving scaling symmetries unexplored.

Method: Proposed ScaleGMNs architecture equivariant to both permutation and scaling transformations, using an autoencoder framework with ScaleGMNs as invariant encoders to align networks without explicit assignment problem solving.

Result: Successfully aligned Implicit Neural Representations (INRs) and Convolutional Neural Networks (CNNs) under both permutation and scaling symmetries, enabling similar networks to converge in the same loss basin.

Conclusion: The approach facilitates model merging through smooth linear interpolation while avoiding high-loss regions, demonstrating that leveraging both permutation and scaling symmetries is more effective than using permutation symmetries alone.

Abstract: Neural network parameterizations exhibit inherent symmetries that yield multiple equivalent minima within the loss landscape. Scale Graph Metanetworks (ScaleGMNs) explicitly leverage these symmetries by proposing an architecture equivariant to both permutation and parameter scaling transformations. Previous work by Ainsworth et al. (2023) addressed permutation symmetries through a computationally intensive combinatorial assignment problem, demonstrating that leveraging permutation symmetries alone can map networks into a shared loss basin. In this work, we extend their approach by also incorporating scaling symmetries, presenting an autoencoder framework utilizing ScaleGMNs as invariant encoders. Experimental results demonstrate that our method aligns Implicit Neural Representations (INRs) and Convolutional Neural Networks (CNNs) under both permutation and scaling symmetries without explicitly solving the assignment problem. This approach ensures that similar networks naturally converge within the same basin, facilitating model merging, i.e., smooth linear interpolation while avoiding regions of high loss. The code is publicly available on our GitHub repository.

[942] Diffusion Model Based Signal Recovery Under 1-Bit Quantization

Youming Chen, Zhaoqiang Liu

Main category: cs.LG

TL;DR: Diff-OneBit is a diffusion model-based approach for signal recovery under 1-bit quantization, addressing challenges in 1-bit compressed sensing and logistic regression through a differentiable surrogate likelihood function and plug-and-play framework.

DetailsMotivation: Diffusion models are powerful priors for signal recovery but face challenges in 1-bit quantization tasks due to non-differentiable or implicit link functions in these applications.

Method: Uses a differentiable surrogate likelihood function to model 1-bit quantization, integrated into a plug-and-play framework that separates data-fidelity from diffusion prior, allowing any pretrained DM to serve as a denoiser in iterative reconstruction.

Result: Extensive experiments on FFHQ, CelebA and ImageNet datasets show Diff-OneBit produces high-fidelity reconstructed images and outperforms state-of-the-art methods in both reconstruction quality and computational efficiency for 1-bit compressed sensing and logistic regression.

Conclusion: Diff-OneBit successfully addresses the challenge of applying diffusion models to 1-bit quantization tasks through its differentiable surrogate approach and flexible framework, achieving superior performance compared to existing methods.

Abstract: Diffusion models (DMs) have demonstrated to be powerful priors for signal recovery, but their application to 1-bit quantization tasks, such as 1-bit compressed sensing and logistic regression, remains a challenge. This difficulty stems from the inherent non-linear link function in these tasks, which is either non-differentiable or lacks an explicit characterization. To tackle this issue, we introduce Diff-OneBit, which is a fast and effective DM-based approach for signal recovery under 1-bit quantization. Diff-OneBit addresses the challenge posed by non-differentiable or implicit links functions via leveraging a differentiable surrogate likelihood function to model 1-bit quantization, thereby enabling gradient based iterations. This function is integrated into a flexible plug-and-play framework that decouples the data-fidelity term from the diffusion prior, allowing any pretrained DM to act as a denoiser within the iterative reconstruction process. Extensive experiments on the FFHQ, CelebA and ImageNet datasets demonstrate that Diff-OneBit gives high-fidelity reconstructed images, outperforming state-of-the-art methods in both reconstruction quality and computational efficiency across 1-bit compressed sensing and logistic regression tasks.

[943] PID-controlled Langevin Dynamics for Faster Sampling of Generative Models

Hongyi Chen, Jianhai Shu, Jingtao Ding, Yong Li, Xiao-Ping Zhang

Main category: cs.LG

TL;DR: PID-controlled Langevin Dynamics (PIDLD) accelerates sampling by using control theory principles, combining historical gradients and gradient trends to reduce iterations while maintaining quality.

DetailsMotivation: Langevin dynamics sampling suffers from slow generation speed due to numerous fine-grained iterations needed for convergence.

Method: Reinterprets sampling using control theory, treating energy gradients as feedback signals and combining integral (historical gradients) and derivative (gradient trends) terms to efficiently traverse energy landscapes.

Result: Achieves higher quality samples with fewer steps across image generation and reasoning tasks, making Langevin-based models more practical for efficiency-critical applications.

Conclusion: PIDLD significantly reduces iteration requirements without additional training, datasets, or prior information, making it immediately integrable with any Langevin-based method.

Abstract: Langevin dynamics sampling suffers from extremely low generation speed, fundamentally limited by numerous fine-grained iterations to converge to the target distribution. We introduce PID-controlled Langevin Dynamics (PIDLD), a novel sampling acceleration algorithm that reinterprets the sampling process using control-theoretic principles. By treating energy gradients as feedback signals, PIDLD combines historical gradients (the integral term) and gradient trends (the derivative term) to efficiently traverse energy landscapes and adaptively stabilize, thereby significantly reducing the number of iterations required to produce high-quality samples. Our approach requires no additional training, datasets, or prior information, making it immediately integrable with any Langevin-based method. Extensive experiments across image generation and reasoning tasks demonstrate that PIDLD achieves higher quality with fewer steps, making Langevin-based generative models more practical for efficiency-critical applications. The implementation can be found at \href{https://github.com/tsinghua-fib-lab/PIDLD}{https://github.com/tsinghua-fib-lab/PIDLD}.

[944] SculptDrug : A Spatial Condition-Aware Bayesian Flow Model for Structure-based Drug Design

Qingsong Zhong, Haomin Yu, Yan Lin, Wangmeng Shen, Long Zeng, Jilin Hu

Main category: cs.LG

TL;DR: SculptDrug: A spatial condition-aware generative model using Bayesian flow networks to address challenges in structure-based drug design, incorporating boundary constraints and hierarchical structural conditions for improved ligand generation.

DetailsMotivation: Existing generative models for structure-based drug design face challenges in incorporating boundary conditions, integrating hierarchical structural conditions, and ensuring spatial modeling fidelity.

Method: Uses Bayesian flow networks with progressive denoising strategy, Boundary Awareness Block for protein surface constraints, and Hierarchical Encoder for global structural context and fine-grained interactions.

Result: Outperforms state-of-the-art baselines on CrossDocked dataset, demonstrating effectiveness of spatial condition-aware modeling.

Conclusion: SculptDrug effectively addresses key limitations in structure-based drug design through spatial condition-aware generative modeling with boundary constraints and hierarchical structural integration.

Abstract: Structure-Based drug design (SBDD) has emerged as a popular approach in drug discovery, leveraging three-dimensional protein structures to generate drug ligands. However, existing generative models encounter several key challenges: (1) incorporating boundary condition constraints, (2) integrating hierarchical structural conditions, and (3) ensuring spatial modeling fidelity. To address these limitations, we propose SculptDrug, a spatial condition-aware generative model based on Bayesian flow networks (BFNs). First, SculptDrug follows a BFN-based framework and employs a progressive denoising strategy to ensure spatial modeling fidelity, iteratively refining atom positions while enhancing local interactions for precise spatial alignment. Second, we introduce a Boundary Awareness Block that incorporates protein surface constraints into the generative process to ensure that generated ligands are geometrically compatible with the target protein. Third, we design a Hierarchical Encoder that captures global structural context while preserving fine-grained molecular interactions, ensuring overall consistency and accurate ligand-protein conformations. We evaluate SculptDrug on the CrossDocked dataset, and experimental results demonstrate that SculptDrug outperforms state-of-the-art baselines, highlighting the effectiveness of spatial condition-aware modeling.

[945] BSO: Binary Spiking Online Optimization Algorithm

Yu Liang, Yu Yang, Wenjie Wei, Ammar Belatreche, Shuai Wang, Malu Zhang, Yang Yang

Main category: cs.LG

TL;DR: BSO and T-BSO are novel online training algorithms for Binary Spiking Neural Networks that reduce memory overhead by eliminating latent weights storage and using flip signals for direct weight updates.

DetailsMotivation: Binary Spiking Neural Networks offer efficiency advantages but their training requires substantial memory overhead due to latent weights storage and temporal processing requirements.

Method: BSO directly updates weights through flip signals triggered when gradient momentum-weight product exceeds a threshold. T-BSO extends this by capturing temporal gradient information for adaptive threshold adjustment.

Result: Both BSO and T-BSO achieve superior optimization performance compared to existing BSNN training methods, with theoretical convergence guarantees and formal regret bounds.

Conclusion: The proposed online training algorithms significantly reduce training memory while maintaining or improving performance for Binary Spiking Neural Networks.

Abstract: Binary Spiking Neural Networks (BSNNs) offer promising efficiency advantages for resource-constrained computing. However, their training algorithms often require substantial memory overhead due to latent weights storage and temporal processing requirements. To address this issue, we propose Binary Spiking Online (BSO) optimization algorithm, a novel online training algorithm that significantly reduces training memory. BSO directly updates weights through flip signals under the online training framework. These signals are triggered when the product of gradient momentum and weights exceeds a threshold, eliminating the need for latent weights during training. To enhance performance, we propose T-BSO, a temporal-aware variant that leverages the inherent temporal dynamics of BSNNs by capturing gradient information across time steps for adaptive threshold adjustment. Theoretical analysis establishes convergence guarantees for both BSO and T-BSO, with formal regret bounds characterizing their convergence rates. Extensive experiments demonstrate that both BSO and T-BSO achieve superior optimization performance compared to existing training methods for BSNNs. The codes are available at https://github.com/hamings1/BSO.

[946] Hierarchical Frequency-Decomposition Graph Neural Networks for Road Network Representation Learning

Jingtian Ma, Jingyuan Wang, Leong Hou U

Main category: cs.LG

TL;DR: HiFiNet is a hierarchical frequency-decomposition graph neural network that unifies spatial and spectral modeling for road network representation learning, addressing limitations of existing methods by capturing both global trends and local fluctuations.

DetailsMotivation: Existing graph neural networks for road networks either capture local topology but over-smooth (spatial-based) or analyze global frequency but overlook local variations (spectral-based), creating spatial-spectral misalignment that limits modeling capacity.

Method: Constructs multi-level hierarchy of virtual nodes for localized frequency analysis, uses decomposition-updating-reconstruction framework with topology-aware graph transformer to separately model and fuse low- and high-frequency signals.

Result: Demonstrates superior performance and generalization ability across multiple real-world datasets and four downstream tasks, capturing effective road network representations.

Conclusion: HiFiNet successfully bridges the spatial-spectral gap in road network modeling by unifying spatial and spectral approaches through hierarchical frequency decomposition, providing a theoretically justified and empirically validated solution.

Abstract: Road networks are critical infrastructures underpinning intelligent transportation systems and their related applications. Effective representation learning of road networks remains challenging due to the complex interplay between spatial structures and frequency characteristics in traffic patterns. Existing graph neural networks for modeling road networks predominantly fall into two paradigms: spatial-based methods that capture local topology but tend to over-smooth representations, and spectral-based methods that analyze global frequency components but often overlook localized variations. This spatial-spectral misalignment limits their modeling capacity for road networks exhibiting both coarse global trends and fine-grained local fluctuations. To bridge this gap, we propose HiFiNet, a novel hierarchical frequency-decomposition graph neural network that unifies spatial and spectral modeling. HiFiNet constructs a multi-level hierarchy of virtual nodes to enable localized frequency analysis, and employs a decomposition-updating-reconstruction framework with a topology-aware graph transformer to separately model and fuse low- and high-frequency signals. Theoretically justified and empirically validated on multiple real-world datasets across four downstream tasks, HiFiNet demonstrates superior performance and generalization ability in capturing effective road network representations.

[947] FLClear: Visually Verifiable Multi-Client Watermarking for Federated Learning

Chen Gu, Yingying Sun, Yifan She, Donghui Hu

Main category: cs.LG

TL;DR: FLClear is a federated learning framework that provides collision-free watermark aggregation, enhanced security, and visually interpretable ownership verification to protect client intellectual property rights.

DetailsMotivation: To protect client intellectual property rights in federated learning against malicious central servers that may manipulate global models to erase client contributions or falsely claim ownership.

Method: Uses a transposed model jointly optimized with contrastive learning to integrate watermarking and main task objectives, with watermark reconstruction for verification through visual inspection and structural similarity metrics.

Result: Comprehensive experiments show FLClear consistently outperforms state-of-the-art FL watermarking methods across various datasets, aggregation schemes, and attack scenarios.

Conclusion: FLClear effectively addresses limitations of existing FL watermarking approaches by providing collision-free aggregation, enhanced security, and intuitive verification mechanisms.

Abstract: Federated learning (FL) enables multiple clients to collaboratively train a shared global model while preserving the privacy of their local data. Within this paradigm, the intellectual property rights (IPR) of client models are critical assets that must be protected. In practice, the central server responsible for maintaining the global model may maliciously manipulate the global model to erase client contributions or falsely claim sole ownership, thereby infringing on clients’ IPR. Watermarking has emerged as a promising technique for asserting model ownership and protecting intellectual property. However, existing FL watermarking approaches remain limited, suffering from potential watermark collisions among clients, insufficient watermark security, and non-intuitive verification mechanisms. In this paper, we propose FLClear, a novel framework that simultaneously achieves collision-free watermark aggregation, enhanced watermark security, and visually interpretable ownership verification. Specifically, FLClear introduces a transposed model jointly optimized with contrastive learning to integrate the watermarking and main task objectives. During verification, the watermark is reconstructed from the transposed model and evaluated through both visual inspection and structural similarity metrics, enabling intuitive and quantitative ownership verification. Comprehensive experiments conducted over various datasets, aggregation schemes, and attack scenarios demonstrate the effectiveness of FLClear and confirm that it consistently outperforms state-of-the-art FL watermarking methods.

[948] Spectral Bias Mitigation via xLSTM-PINN: Memory-Gated Representation Refinement for Physics-Informed Learning

Ze Tao, Darui Zhao, Fujun Liu, Ke Xu, Xiangsheng Hu

Main category: cs.LG

TL;DR: xLSTM-PINN introduces gated-memory multiscale feature extraction and adaptive residual-data weighting to address spectral bias and weak extrapolation in physics-informed learning for PDEs, achieving improved accuracy and broader resolvable bandwidth.

DetailsMotivation: To overcome spectral bias, residual-data imbalance, and weak extrapolation issues in current physics-informed learning methods for PDEs.

Method: Combines gated cross-scale memory, staged frequency curriculum, and adaptive residual reweighting in a representation-level spectral remodeling approach.

Result: Achieves markedly lower spectral error and RMSE, broader stable learning-rate window, raised high-frequency kernel weights, right-shifted resolvable bandwidth, and shorter high-k error decay across four benchmarks.

Conclusion: The method effectively suppresses spectral bias, widens resolvable bandwidth, improves accuracy and transferability without altering AD or physics losses.

Abstract: Physics-informed learning for PDEs is surging across scientific computing and industrial simulation, yet prevailing methods face spectral bias, residual-data imbalance, and weak extrapolation. We introduce a representation-level spectral remodeling xLSTM-PINN that combines gated-memory multiscale feature extraction with adaptive residual-data weighting to curb spectral bias and strengthen extrapolation. Across four benchmarks, we integrate gated cross-scale memory, a staged frequency curriculum, and adaptive residual reweighting, and verify with analytic references and extrapolation tests, achieving markedly lower spectral error and RMSE and a broader stable learning-rate window. Frequency-domain benchmarks show raised high-frequency kernel weights and a right-shifted resolvable bandwidth, shorter high-k error decay and time-to-threshold, and narrower error bands with lower MSE, RMSE, MAE, and MaxAE. Compared with the baseline PINN, we reduce MSE, RMSE, MAE, and MaxAE across all four benchmarks and deliver cleaner boundary transitions with attenuated high-frequency ripples in both frequency and field maps. This work suppresses spectral bias, widens the resolvable band and shortens the high-k time-to-threshold under the same budget, and without altering AD or physics losses improves accuracy, reproducibility, and transferability.

[949] Regret Guarantees for Linear Contextual Stochastic Shortest Path

Dor Polikar, Alon Cohen

Main category: cs.LG

TL;DR: LR-CSSP algorithm for linear Contextual Stochastic Shortest Path problems achieves sublinear regret bounds without prior knowledge of transition dynamics, loss functions, or context-to-MDP mapping.

DetailsMotivation: Address the challenge of contextual SSP where contexts determine MDPs via unknown linear functions, and insufficient knowledge can cause prolonged or non-terminating episodes.

Method: Propose LR-CSSP algorithm that handles continuous context spaces and ensures episode termination while learning the unknown linear mapping and MDP dynamics.

Result: Achieves regret bound of O~(K^{2/3}d^{2/3}|S||A|^{1/3}B_^2T_) and O~(√(K·d^2|S|^3|A|B_*^3/ℓ_min)) when all costs exceed ℓ_min.

Conclusion: LR-CSSP effectively handles continuous context spaces in CSSP while ensuring reasonable episode termination and achieving sublinear regret.

Abstract: We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner’s objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of $\widetilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_\star^2 T_\star \log (1/ δ))$, where $K$ is the number of episodes, $d$ is the context dimension, $S$ and $A$ are the sets of states and actions respectively, $B_\star$ bounds the optimal cumulative loss and $T_\star$, unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed $\ell_{\min}$, LR-CSSP attains a regret of $\widetilde O(\sqrt{K \cdot d^2 |S|^3 |A| B_\star^3 \log(1/δ)/\ell_{\min}})$. Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.

[950] A Closer Look at Personalized Fine-Tuning in Heterogeneous Federated Learning

Minghui Chen, Hrad Ghoukasian, Ruinan Jin, Zehua Wang, Sai Praneeth Karimireddy, Xiaoxiao Li

Main category: cs.LG

TL;DR: LP-FT adapts centralized linear probing followed by fine-tuning to federated learning, outperforming standard personalized fine-tuning by mitigating federated feature distortion and balancing personalization with generalization.

DetailsMotivation: Federated Learning struggles to balance global generalization and local personalization due to non-identical data distributions across clients, and standard Personalized Fine-Tuning often overfits to skewed distributions or fails under domain shifts.

Method: Adapt Linear Probing followed by full Fine-Tuning (LP-FT) from centralized learning to federated setting, using phased parameter updates to mitigate feature distortion during local personalization.

Result: LP-FT demonstrates superiority across seven datasets and six PFT variants, effectively balancing personalization and generalization while mitigating federated feature distortion where local fine-tuning destabilizes globally learned features.

Conclusion: LP-FT offers robust personalization in FL, with theoretical characterization of its advantages and established conditions (partial feature overlap, covariate-concept shift) where it outperforms standard fine-tuning, providing actionable deployment guidelines.

Abstract: Federated Learning (FL) enables decentralized, privacy-preserving model training but struggles to balance global generalization and local personalization due to non-identical data distributions across clients. Personalized Fine-Tuning (PFT), a popular post-hoc solution, fine-tunes the final global model locally but often overfits to skewed client distributions or fails under domain shifts. We propose adapting Linear Probing followed by full Fine-Tuning (LP-FT), a principled centralized strategy for alleviating feature distortion (Kumar et al., 2022), to the FL setting. Through systematic evaluation across seven datasets and six PFT variants, we demonstrate LP-FT’s superiority in balancing personalization and generalization. Our analysis uncovers federated feature distortion, a phenomenon where local fine-tuning destabilizes globally learned features, and theoretically characterizes how LP-FT mitigates this via phased parameter updates. We further establish conditions (e.g., partial feature overlap, covariate-concept shift) under which LP-FT outperforms standard fine-tuning, offering actionable guidelines for deploying robust personalization in FL.

[951] Center-Outward q-Dominance: A Sample-Computable Proxy for Strong Stochastic Dominance in Multi-Objective Optimisation

Robin van der Laag, Hao Wang, Thomas Bäck, Yingjie Fan

Main category: cs.LG

TL;DR: Introduces center-outward q-dominance based on optimal transport theory for ranking multivariate distributions in stochastic multi-objective optimization, with empirical tests and applications in hyperparameter tuning and algorithm selection.

DetailsMotivation: Current SMOOP methods rely on scalarization which loses information and is unreliable for ranking multivariate distributions. There's a need for principled methods to identify truly stochastically dominant solutions.

Method: Developed center-outward q-dominance relation based on optimal transport theory, proved it implies strong first-order stochastic dominance, created empirical test procedure with explicit sample size threshold n*(δ) for Type I error control.

Result: Successfully applied q-dominance to compare hyperparameter tuners when expected hypervolume indicator becomes indistinguishable, and improved NSGA-II convergence rate on noise-augmented ZDT problems by replacing mean-based selection with q-dominance.

Conclusion: Center-outward q-dominance provides a principled, tractable foundation for identifying stochastically dominant solutions in SMOOPs, outperforming traditional scalarization methods.

Abstract: Stochastic multi-objective optimization (SMOOP) requires ranking multivariate distributions; yet, most empirical studies perform scalarization, which loses information and is unreliable. Based on the optimal transport theory, we introduce the center-outward q-dominance relation and prove it implies strong first-order stochastic dominance (FSD). Also, we develop an empirical test procedure based on q-dominance, and derive an explicit sample size threshold, $n^*(δ)$, to control the Type I error. We verify the usefulness of our approach in two scenarios: (1) as a ranking method in hyperparameter tuning; (2) as a selection method in multi-objective optimization algorithms. For the former, we analyze the final stochastic Pareto sets of seven multi-objective hyperparameter tuners on the YAHPO-MO benchmark tasks with q-dominance, which allows us to compare these tuners when the expected hypervolume indicator (HVI, the most common performance metric) of the Pareto sets becomes indistinguishable. For the latter, we replace the mean value-based selection in the NSGA-II algorithm with $q$-dominance, which shows a superior convergence rate on noise-augmented ZDT benchmark problems. These results establish center-outward q-dominance as a principled, tractable foundation for seeking truly stochastically dominant solutions for SMOOPs.

[952] Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alex F. Spies, Alessandra Russo, Michael Dennis

Main category: cs.LG

TL;DR: ATLAS is a novel method that generates joint autocurricula over tasks and levels to produce solvable yet challenging task-level pairs for policy training, outperforming random sampling approaches.

DetailsMotivation: Training general agents to follow complex instructions in intricate environments remains challenging, as random sampling often produces unsolvable task-level combinations, highlighting the need for co-designing tasks and levels.

Method: ATLAS builds upon unsupervised environment design (UED) to automatically generate joint autocurricula over both tasks and levels, using mutations that leverage the structure of both tasks and levels.

Result: Experiments show ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely, and mutations accelerate convergence to performant policies.

Conclusion: ATLAS successfully addresses the challenge of co-designing tasks and levels through joint autocurricula, demonstrating superior performance over baseline methods in complex reinforcement learning scenarios.

Abstract: Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.

[953] CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching

Wenzhang Du

Main category: cs.LG

TL;DR: A curvature-adaptive optimization method that periodically sketches low-rank Hessian subspaces via Hessian-vector products, preconditioning gradients in that subspace while using first-order methods elsewhere, achieving faster convergence while maintaining accuracy.

DetailsMotivation: First-order optimizers are reliable but slow in sharp, anisotropic regions, motivating the need for curvature-adaptive methods that can handle such challenging optimization landscapes more efficiently.

Method: Periodically sketches a low-rank Hessian subspace using Hessian-vector products, preconditions gradients only in that subspace while using first-order methods in the orthogonal complement, with a widened stable stepsize range.

Result: Achieves 2.95x faster convergence than Adam on CIFAR-100/ResNet-18 to reach train-loss threshold 0.75, while matching final test accuracy. Performance is insensitive to sketch rank k across {1,3,5}.

Conclusion: The method provides substantial speedup in entering low-loss regions while maintaining final accuracy, with simple one-knob tuning and principled curvature-free ablation capability.

Abstract: First-order optimizers are reliable but slow in sharp, anisotropic regions. We study a curvature-adaptive method that periodically sketches a low-rank Hessian subspace via Hessian–vector products and preconditions gradients only in that subspace, leaving the orthogonal complement first-order. For L-smooth non-convex objectives, we recover the standard O(1/T) stationarity guarantee with a widened stable stepsize range; under a Polyak–Lojasiewicz (PL) condition with bounded residual curvature outside the sketch, the loss contracts at refresh steps. On CIFAR-10/100 with ResNet-18/34, the method enters the low-loss region substantially earlier: measured by epochs to a pre-declared train-loss threshold (0.75), it reaches the threshold 2.95x faster than Adam on CIFAR-100/ResNet-18, while matching final test accuracy. The approach is one-knob: performance is insensitive to the sketch rank k across {1,3,5}, and k=0 yields a principled curvature-free ablation. We release anonymized logs and scripts that regenerate all figures and tables.

[954] Adaptive Graph Rewiring to Mitigate Over-Squashing in Mesh-Based GNNs for Fluid Dynamics Simulations

Sangwoo Seo, Hyunsung Kim, Jiwan Kim, Chanyoung Park

Main category: cs.LG

TL;DR: AdaMeshNet introduces adaptive graph rewiring during message-passing to model gradual physical interactions in mesh-based GNNs for fluid simulation, overcoming over-squashing problems caused by mesh refinement.

DetailsMotivation: Conventional graph rewiring methods assume instantaneous interactions between distant nodes and disregard distance information, which is physically unrealistic for modeling fluid dynamics where interactions propagate gradually.

Method: Proposes adaptive rewiring that computes rewiring delay scores based on shortest-path distance and velocity differences, then dynamically selects message-passing layers to add new edges during the propagation process.

Result: Extensive experiments show AdaMeshNet outperforms conventional rewiring methods in mesh-based fluid simulations, enabling more accurate predictions by modeling sequential physical interactions.

Conclusion: The adaptive rewiring framework effectively addresses over-squashing in mesh-based GNNs and better captures the gradual propagation of physical interactions in fluid dynamics simulations.

Abstract: Mesh-based simulation using Graph Neural Networks (GNNs) has been recognized as a promising approach for modeling fluid dynamics. However, the mesh refinement techniques which allocate finer resolution to regions with steep gradients can induce the over-squashing problem in mesh-based GNNs, which prevents the capture of long-range physical interactions. Conventional graph rewiring methods attempt to alleviate this issue by adding new edges, but they typically complete all rewiring operations before applying them to the GNN. These approaches are physically unrealistic, as they assume instantaneous interactions between distant nodes and disregard the distance information between particles. To address these limitations, we propose a novel framework, called Adaptive Graph Rewiring in Mesh-Based Graph Neural Networks (AdaMeshNet), that introduces an adaptive rewiring process into the message-passing procedure to model the gradual propagation of physical interactions. Our method computes a rewiring delay score for bottleneck nodes in the mesh graph, based on the shortest-path distance and the velocity difference. Using this score, it dynamically selects the message-passing layer at which new edges are rewired, which can lead to adaptive rewiring in a mesh graph. Extensive experiments on mesh-based fluid simulations demonstrate that AdaMeshNet outperforms conventional rewiring methods, effectively modeling the sequential nature of physical interactions and enabling more accurate predictions.

[955] Training Instabilities Induce Flatness Bias in Gradient Descent

Lawrence Wang, Stephen J. Roberts

Main category: cs.LG

TL;DR: Training instabilities in gradient descent induce implicit bias toward flatter minima, improving generalization through rotational polarity of eigenvectors.

DetailsMotivation: Modern deep networks often perform best beyond the classical stability threshold, suggesting instabilities may have beneficial effects that classical theory doesn't capture.

Method: Analyze gradient descent dynamics beyond stability threshold, focusing on rotational polarity of eigenvectors (RPE) phenomenon and extend to stochastic GD and Adam.

Result: Instabilities drive parameters toward flatter regions via RPE mechanism, with rotations increasing with learning rates. This flattening persists in stochastic GD and outweighs minibatch noise.

Conclusion: Training instabilities play a constructive role in deep learning by inducing implicit bias toward flatter minima, which improves generalization performance.

Abstract: Classical analyses of gradient descent (GD) define a stability threshold based on the largest eigenvalue of the loss Hessian, often termed sharpness. When the learning rate lies below this threshold, training is stable and the loss decreases monotonically. Yet, modern deep networks often achieve their best performance beyond this regime. We demonstrate that such instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape and thereby improving generalization. The key mechanism is the Rotational Polarity of Eigenvectors (RPE), a geometric phenomenon in which the leading eigenvectors of the Hessian rotate during training instabilities. These rotations, which increase with learning rates, promote exploration and provably lead to flatter minima. This theoretical framework extends to stochastic GD, where instability-driven flattening persists and its empirical effects outweigh minibatch noise. Finally, we show that restoring instabilities in Adam further improves generalization. Together, these results establish and understand the constructive role of training instabilities in deep learning.

[956] Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving

Timur Anvar, Jeffrey Chen, Yuyan Wang, Rohan Chandra

Main category: cs.LG

TL;DR: Small LLMs (<14B parameters) can support autonomous highway driving through reward shaping rather than direct control, but exhibit conservative bias and performance trade-offs compared to RL-only approaches.

DetailsMotivation: Address limitations of RL's reward function specification and LLM-only approaches' instability in safety-critical settings by exploring hybrid methods using small, locally deployed LLMs for reward augmentation.

Method: Case study comparing RL-only, LLM-only, and hybrid approaches where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time.

Result: RL-only: 73-89% success with reasonable efficiency; LLM-only: up to 94% success but severely degraded speed; hybrid approaches fall between extremes with systematic conservative bias and model-dependent variability.

Conclusion: Current small LLMs have important limitations for safety-critical control tasks despite potential for reward shaping, showing systematic conservative bias and substantial performance trade-offs.

Abstract: Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (< 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.

[957] Linear time small coresets for k-mean clustering of segments with applications

David Denisov, Shlomi Dolev, Dan Felmdan, Michael Segal

Main category: cs.LG

TL;DR: First coreset construction for k-means clustering on arbitrary segments with O(log² n) size, enabling efficient streaming/distributed computation with minimal accuracy loss.

DetailsMotivation: Need efficient clustering methods for segment data (common in computer vision, GIS) that can handle streaming/distributed settings while maintaining accuracy.

Method: Develop ε-coreset construction that approximates k-means objective for segments, using weighted subset that preserves clustering quality for any center set.

Result: For constant k and ε, achieves coreset size O(log² n) computable in O(nd) time. Experiments show substantial speedups with minimal accuracy loss in real applications.

Conclusion: Proposed method provides both theoretical guarantees and practical efficiency for segment clustering, enabling real-time applications like video tracking.

Abstract: We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

[958] LMM-IR: Large-Scale Netlist-Aware Multimodal Framework for Static IR-Drop Prediction

Kai Ma, Zhen Wang, Hongquan He, Qi Xu, Tinghuan Chen, Hao Geng

Main category: cs.LG

TL;DR: A novel multimodal approach using large-scale netlist transformer (LNT) to predict static IR drop by representing netlist topology as 3D point clouds, achieving state-of-the-art performance.

DetailsMotivation: Static IR drop analysis is time-consuming and requires iterative analysis, creating computational burden in chip design. Fast and accurate IR drop prediction is needed to reduce overall design time.

Method: Proposed multimodal approach processes SPICE files through large-scale netlist transformer (LNT), representing netlist topology as 3D point cloud representations to handle large netlists (hundreds of thousands to millions nodes). All data types (netlist files and image data) are encoded into latent space features for static voltage drop prediction.

Result: Experimental results show the proposed algorithm achieves the best F1 score and lowest MAE among ICCAD 2023 contest winning teams and state-of-the-art algorithms.

Conclusion: The multimodal approach enables efficient integration of multiple data modalities for complementary predictions, providing fast and accurate IR drop prediction for chip design.

Abstract: Static IR drop analysis is a fundamental and critical task in the field of chip design. Nevertheless, this process can be quite time-consuming, potentially requiring several hours. Moreover, addressing IR drop violations frequently demands iterative analysis, thereby causing the computational burden. Therefore, fast and accurate IR drop prediction is vital for reducing the overall time invested in chip design. In this paper, we firstly propose a novel multimodal approach that efficiently processes SPICE files through large-scale netlist transformer (LNT). Our key innovation is representing and processing netlist topology as 3D point cloud representations, enabling efficient handling of netlist with up to hundreds of thousands to millions nodes. All types of data, including netlist files and image data, are encoded into latent space as features and fed into the model for static voltage drop prediction. This enables the integration of data from multiple modalities for complementary predictions. Experimental results demonstrate that our proposed algorithm can achieve the best F1 score and the lowest MAE among the winning teams of the ICCAD 2023 contest and the state-of-the-art algorithms.

[959] FedTopo: Topology-Informed Representation Alignment in Federated Learning under Non-I.I.D. Conditions

Ke Hu, Liyao Xiang, Peng Tang, Weidong Qiu

Main category: cs.LG

TL;DR: FedTopo improves federated learning under non-IID data by using topological information to align client representations and reduce drift.

DetailsMotivation: Current federated learning models perform poorly with heterogeneous client data due to diverging feature representations and lack of global topology capture in visual tasks.

Method: Uses Topological-Guided Block Screening to select topology-informative blocks, creates Topological Embeddings, and applies Topological Alignment Loss to maintain consistency.

Result: Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 show faster convergence and higher accuracy across four non-IID partitions compared to baselines.

Conclusion: FedTopo effectively leverages topological information to address representation drift in federated learning with non-IID data, improving performance and convergence.

Abstract: Current federated-learning models deteriorate under heterogeneous (non-I.I.D.) client data, as their feature representations diverge and pixel- or patch-level objectives fail to capture the global topology which is essential for high-dimensional visual tasks. We propose FedTopo, a framework that integrates Topological-Guided Block Screening (TGBS) and Topological Embedding (TE) to leverage topological information, yielding coherently aligned cross-client representations by Topological Alignment Loss (TAL). First, Topology-Guided Block Screening (TGBS) automatically selects the most topology-informative block, i.e., the one with maximal topological separability, whose persistence-based signatures best distinguish within- versus between-class pairs, ensuring that subsequent analysis focuses on topology-rich features. Next, this block yields a compact Topological Embedding, which quantifies the topological information for each client. Finally, a Topological Alignment Loss (TAL) guides clients to maintain topological consistency with the global model during optimization, reducing representation drift across rounds. Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 under four non-I.I.D. partitions show that FedTopo accelerates convergence and improves accuracy over strong baselines.

[960] Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

Zhenshuo Zhang, Minxuan Duan, Youran Ye, Hongyang R. Zhang

Main category: cs.LG

TL;DR: PolicyGradEx efficiently groups multiple RL objectives into k clusters using meta-training and fine-tuning, achieving 16% performance improvement and 26x speedup over baselines through loss-based clustering.

DetailsMotivation: Learning single policies for many objectives becomes suboptimal as n grows; need efficient grouping of related objectives for better multi-objective RL in robotics, control, and language models.

Method: Two-stage approach: meta-train policy for all objectives, then fine-tune on random subsets using first-order approximation. Estimate task affinity scores and cluster objectives by maximizing intra-cluster affinity.

Result: Outperforms SOTA by 16% on average, 26x faster clustering. Loss-based clustering beats random/ gradient-similarity grouping by 19%. Verified 2% approximation error across RL environments.

Conclusion: PolicyGradEx effectively groups objectives using policy network properties, with validated generalization error analysis via Hessian trace measurements.

Abstract: We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k \ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all $n$ objectives is suboptimal as $n$ grows. We introduce a two-stage procedure – meta-training followed by fine-tuning – to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a $2%$ error margin across various RL environments. The resulting algorithm, PolicyGradEx, efficiently estimates an aggregate task-affinity score matrix given a policy evaluation algorithm. Based on the estimated affinity score matrix, we cluster the $n$ objectives into $k$ groups by maximizing the intra-cluster affinity scores. Experiments on three robotic control and the Meta-World benchmarks demonstrate that our approach outperforms state-of-the-art baselines by $16%$ on average, while delivering up to $26\times$ faster speedup relative to performing full training to obtain the clusters. Ablation studies validate each component of our approach. For instance, compared with random grouping and gradient-similarity-based grouping, our loss-based clustering yields an improvement of $19%$. Finally, we analyze the generalization error of policy networks by measuring the Hessian trace of the loss surface, which gives non-vacuous measures relative to the observed generalization errors.

[961] NFQ2.0: The CartPole Benchmark Revisited

Sascha Lange, Roland Hafner, Martin Riedmiller

Main category: cs.LG

TL;DR: Modernized NFQ algorithm (NFQ2.0) applied to CartPole benchmark with focus on reproducibility and robustness in real-world industrial systems.

DetailsMotivation: To revisit the pioneering NFQ algorithm, address its reproducibility issues, and improve its application in real-world industrial control problems.

Method: Proposed NFQ2.0 variant with ablation studies to identify key design decisions and hyperparameters that enhance performance and stability.

Result: NFQ2.0 shows improved performance and stability over original NFQ, with better repeatability on real-world industrial systems.

Conclusion: The findings help practitioners reproduce results and apply deep reinforcement learning more effectively in industrial contexts.

Abstract: This article revisits the 20-year-old neural fitted Q-iteration (NFQ) algorithm on its classical CartPole benchmark. NFQ was a pioneering approach towards modern Deep Reinforcement Learning (Deep RL) in applying multi-layer neural networks to reinforcement learning for real-world control problems. We explore the algorithm’s conceptual simplicity and its transition from online to batch learning, which contributed to its stability. Despite its initial success, NFQ required extensive tuning and was not easily reproducible on real-world control problems. We propose a modernized variant NFQ2.0 and apply it to the CartPole task, concentrating on a real-world system build from standard industrial components, to investigate and improve the learning process’s repeatability and robustness. Through ablation studies, we highlight key design decisions and hyperparameters that enhance performance and stability of NFQ2.0 over the original variant. Finally, we demonstrate how our findings can assist practitioners in reproducing and improving results and applying deep reinforcement learning more effectively in industrial contexts.

[962] Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back

Alon Cohen, Liad Erez, Steve Hanneke, Tomer Koren, Yishay Mansour, Shay Moran, Qian Zhang

Main category: cs.LG

TL;DR: Multiclass PAC learning sample complexity is governed by two dimensions: DS dimension controls the DS regime term (DS^1.5/ε) and Natarajan dimension controls the asymptotic term (Nat/ε^2), unlike binary learning which has a single VC dimension.

DetailsMotivation: To resolve the long-standing challenge of extending the fundamental theorem of statistical learning from binary to multiclass classification, where previous work showed DS dimension characterizes learnability but Natarajan dimension still plays a role.

Method: Novel online procedure using self-adaptive multiplicative-weights algorithm with label-space reduction, departing from traditional agnostic learning methods based on uniform convergence or realizable case reductions.

Result: Proved nearly tight agnostic sample complexity bounds of form DS^1.5/ε + Nat/ε^2, tight up to √DS factor, showing both DS and Nat dimensions are essential for multiclass learning.

Conclusion: Multiclass learning inherently involves two structural parameters (DS and Nat dimensions), unlike binary or online classification where single dimensions (VC or Littlestone) control both phenomena.

Abstract: The fundamental theorem of statistical learning states that binary PAC learning is governed by a single parameter – the Vapnik-Chervonenkis (VC) dimension – which determines both learnability and sample complexity. Extending this to multiclass classification has long been challenging, since Natarajan’s work in the late 80s proposing the Natarajan dimension (Nat) as a natural analogue of VC. Daniely and Shalev-Shwartz (2014) introduced the DS dimension, later shown by Brukhim et al. (2022) to characterize multiclass learnability. Brukhim et al. also showed that Nat and DS can diverge arbitrarily, suggesting that multiclass learning is governed by DS rather than Nat. We show that agnostic multiclass PAC sample complexity is in fact governed by two distinct dimensions. Specifically, we prove nearly tight agnostic sample complexity bounds that, up to log factors, take the form $\frac{DS^{1.5}}ε + \frac{Nat}{ε^2}$ where $ε$ is the excess risk. This bound is tight up to a $\sqrt{DS}$ factor in the first term, nearly matching known $Nat/ε^2$ and $DS/ε$ lower bounds. The first term reflects the DS-controlled regime, while the second shows that the Natarajan dimension still dictates asymptotic behavior for small $ε$. Thus, unlike binary or online classification – where a single dimension (VC or Littlestone) controls both phenomena – multiclass learning inherently involves two structural parameters. Our technical approach departs from traditional agnostic learning methods based on uniform convergence or reductions to realizable cases. A key ingredient is a novel online procedure based on a self-adaptive multiplicative-weights algorithm performing a label-space reduction, which may be of independent interest.

[963] Optimal Look-back Horizon for Time Series Forecasting in Federated Learning

Dahao Tang, Nan Yang, Yanli Li, Zhiyu Zhu, Zhibo Jin, Dong Yuan

Main category: cs.LG

TL;DR: A principled framework for adaptive horizon selection in federated time series forecasting using intrinsic space formulation and synthetic data generation to address decentralized, heterogeneous data challenges.

DetailsMotivation: Selecting appropriate look-back horizons is challenging in federated time series forecasting due to decentralized, heterogeneous, non-independent data. Existing approaches are limited to centralized settings.

Method: Introduces synthetic data generator capturing temporal structures, maps time series to intrinsic representation space, and decomposes forecasting loss into Bayesian (irreducible uncertainty) and approximation (finite-sample effects) terms.

Result: Analysis shows increasing look-back horizon improves deterministic pattern identifiability but increases approximation error. Total loss minimized at smallest horizon where irreducible loss saturates while approximation loss rises.

Conclusion: Provides rigorous theoretical foundation for adaptive horizon selection in federated time series forecasting, addressing the fundamental challenge of horizon selection in decentralized settings.

Abstract: Selecting an appropriate look-back horizon remains a fundamental challenge in time series forecasting (TSF), particularly in the federated learning scenarios where data is decentralized, heterogeneous, and often non-independent. While recent work has explored horizon selection by preserving forecasting-relevant information in an intrinsic space, these approaches are primarily restricted to centralized and independently distributed settings. This paper presents a principled framework for adaptive horizon selection in federated time series forecasting through an intrinsic space formulation. We introduce a synthetic data generator (SDG) that captures essential temporal structures in client data, including autoregressive dependencies, seasonality, and trend, while incorporating client-specific heterogeneity. Building on this model, we define a transformation that maps time series windows into an intrinsic representation space with well-defined geometric and statistical properties. We then derive a decomposition of the forecasting loss into a Bayesian term, which reflects irreducible uncertainty, and an approximation term, which accounts for finite-sample effects and limited model capacity. Our analysis shows that while increasing the look-back horizon improves the identifiability of deterministic patterns, it also increases approximation error due to higher model complexity and reduced sample efficiency. We prove that the total forecasting loss is minimized at the smallest horizon where the irreducible loss starts to saturate, while the approximation loss continues to rise. This work provides a rigorous theoretical foundation for adaptive horizon selection for time series forecasting in federated learning.

[964] Attention-Enhanced Convolutional Autoencoder and Structured Delay Embeddings for Weather Prediction

Amirpasha Hedayat, Karthik Duraisamy

Main category: cs.LG

TL;DR: Efficient reduced-order modeling framework for short-range weather prediction using ResNet-based autoencoder with block attention and linear dynamics in latent space, achieving reasonable accuracy with computational efficiency.

DetailsMotivation: To develop an efficient weather prediction framework that prioritizes computational efficiency over extensive resources, unlike recent AI-driven models, while investigating fundamental questions in dimensionality reduction of chaotic systems.

Method: ResNet-based convolutional autoencoder with block attention modules for dimensionality reduction, followed by learning a linear operator in time-delayed embedding of latent space to capture dynamics.

Result: Framework performs well in-distribution on ERA5 dataset but has limitations in generalizing to future states beyond training window. Weather systems show strong temporal correlations captured by linear operations, with projection error being main bottleneck.

Conclusion: Weather systems can be effectively modeled with linear operations in appropriate embedding spaces. Hybrid approaches combining efficient reduced-order models with sophisticated AI architectures are promising for long-term climate modeling where efficiency is crucial.

Abstract: Weather prediction is a quintessential problem involving the forecasting of a complex, nonlinear, and chaotic high-dimensional dynamical system. This work introduces an efficient reduced-order modeling (ROM) framework for short-range weather prediction and investigates fundamental questions in dimensionality reduction and reduced order modeling of such systems. Unlike recent AI-driven models, which require extensive computational resources, our framework prioritizes efficiency while achieving reasonable accuracy. Specifically, a ResNet-based convolutional autoencoder augmented by block attention modules is developed to reduce the dimensionality of high-dimensional weather data. Subsequently, a linear operator is learned in the time-delayed embedding of the latent space to efficiently capture the dynamics. Using the ERA5 reanalysis dataset, we demonstrate that this framework performs well in-distribution as evidenced by effectively predicting weather patterns within training data periods. We also identify important limitations in generalizing to future states, particularly in maintaining prediction accuracy beyond the training window. Our analysis reveals that weather systems exhibit strong temporal correlations that can be effectively captured through linear operations in an appropriately constructed embedding space, and that projection error rather than inference error is the main bottleneck. These findings shed light on some key challenges in reduced-order modeling of chaotic systems and point toward opportunities for hybrid approaches that combine efficient reduced-order models as baselines with more sophisticated AI architectures, particularly for applications in long-term climate modeling where computational efficiency is paramount.

[965] Oxytrees: Model Trees for Bipartite Learning

Pedro Ilídio, Felipe Kenji Nakano, Alireza Gharahighehi, Robbe D’hondt, Ricardo Cerri, Celine Vens

Main category: cs.LG

TL;DR: Oxytrees: proxy-based biclustering model trees for bipartite learning that compress interaction matrices into proxy matrices, enabling faster training and prediction while maintaining competitive performance.

DetailsMotivation: Current bipartite learning methods are often application-specific and don't generalize well, or have scalability issues that limit their practical use across different domains.

Method: Propose Oxytrees using proxy matrices to compress interaction data, a new leaf-assignment algorithm for faster prediction, and linear models with Kronecker product kernels in leaves to create shallower trees.

Result: Achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests while maintaining competitive or superior performance, especially in inductive settings.

Conclusion: Oxytrees provide an efficient and scalable solution for bipartite learning with significant speed improvements and competitive performance, along with a Python API for reproducible research.

Abstract: Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

[966] Genomic Next-Token Predictors are In-Context Learners

Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi

Main category: cs.LG

TL;DR: Genomic models trained on next-nucleotide prediction exhibit emergent in-context learning similar to language models, showing log-linear gains with more demonstrations, supporting that ICL arises from large-scale predictive modeling across modalities.

DetailsMotivation: To investigate whether in-context learning (ICL) can emerge organically in non-linguistic sequence domains through large-scale predictive training, challenging the notion that ICL is unique to human language.

Method: Developed controlled experimental framework with symbolic reasoning tasks in both linguistic and genomic forms, using Evo2 genomic model trained on next-nucleotide prediction at scale comparable to mid-sized LLMs.

Result: Genomic models exhibit log-linear gains in pattern induction with increasing in-context demonstrations, similar to linguistic models - first evidence of organically emergent ICL in genomic sequences.

Conclusion: ICL arises as a consequence of large-scale predictive modeling over rich data, extending emergent meta-learning beyond language to a unified, modality-agnostic view of in-context learning.

Abstract: In-context learning (ICL) – the capacity of a model to infer and apply abstract patterns from examples provided within its input – has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

[967] On Robustness of Linear Classifiers to Targeted Data Poisoning

Nakshatra Gupta, Sumanth Prabhu, Supratik Chakraborty, R Venkatesh

Main category: cs.LG

TL;DR: The paper presents a method to compute robustness bounds against targeted data poisoning attacks where adversaries can only manipulate training labels, proving the problem is NP-Complete and providing efficient practical solutions.

DetailsMotivation: Data poisoning attacks undermine model trustworthiness, and manual detection is difficult due to large training datasets. Automatic measurement of dataset robustness against such attacks is needed.

Method: Prove NP-Completeness of finding robustness for linear classifiers, then develop technique to compute lower and upper bounds of robustness efficiently in practice.

Result: Implementation efficiently computes robustness bounds for many public datasets. Poisoning exceeding these bounds significantly impacts test point classification. Method succeeds where state-of-the-art techniques fail.

Conclusion: The approach effectively measures dataset robustness against label manipulation attacks, providing practical bounds that help assess vulnerability to targeted data poisoning.

Abstract: Data poisoning is a training-time attack that undermines the trustworthiness of learned models. In a targeted data poisoning attack, an adversary manipulates the training dataset to alter the classification of a targeted test point. Given the typically large size of training dataset, manual detection of poisoning is difficult. An alternative is to automatically measure a dataset’s robustness against such an attack, which is the focus of this paper. We consider a threat model wherein an adversary can only perturb the labels of the training dataset, with knowledge limited to the hypothesis space of the victim’s model. In this setting, we prove that finding the robustness is an NP-Complete problem, even when hypotheses are linear classifiers. To overcome this, we present a technique that finds lower and upper bounds of robustness. Our implementation of the technique computes these bounds efficiently in practice for many publicly available datasets. We experimentally demonstrate the effectiveness of our approach. Specifically, a poisoning exceeding the identified robustness bounds significantly impacts test point classification. We are also able to compute these bounds in many more cases where state-of-the-art techniques fail.

[968] The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

Ali Falahati, Mohammad Mohammadi Amiri, Kate Larson, Lukasz Golab

Main category: cs.LG

TL;DR: This paper provides the first formal analysis of recursive retraining in self-consuming generative models, revealing three convergence regimes and proving a fundamental impossibility theorem about preserving diversity, symmetric influence, and independence from initialization simultaneously.

DetailsMotivation: To understand the long-term effects of recursive retraining on alignment in self-consuming generative models, where models train on their own outputs and alignment becomes a recursive rather than one-time process.

Method: The authors use a two-stage curation mechanism based on the Bradley-Terry model, modeling alignment as an interaction between Model Owner (who filters outputs) and Public User (who determines shared outputs). They analyze convergence regimes and prove impossibility theorems.

Result: Three structural convergence regimes identified: consensus collapse, compromise on shared optima, and asymmetric refinement. A fundamental impossibility theorem shows no recursive BT-based curation can simultaneously preserve diversity, ensure symmetric influence, and eliminate dependence on initialization.

Conclusion: Alignment in self-consuming models is not a static goal but an evolving equilibrium shaped by power asymmetries and path dependence, requiring ongoing management rather than one-time optimization.

Abstract: In self-consuming generative models that train on their own outputs, alignment with user preferences becomes a recursive rather than one-time process. We provide the first formal foundation for analyzing the long-term effects of such recursive retraining on alignment. Under a two-stage curation mechanism based on the Bradley-Terry (BT) model, we model alignment as an interaction between two factions: the Model Owner, who filters which outputs should be learned by the model, and the Public User, who determines which outputs are ultimately shared and retained through interactions with the model. Our analysis reveals three structural convergence regimes depending on the degree of preference alignment: consensus collapse, compromise on shared optima, and asymmetric refinement. We prove a fundamental impossibility theorem: no recursive BT-based curation mechanism can simultaneously preserve diversity, ensure symmetric influence, and eliminate dependence on initialization. Framing the process as dynamic social choice, we show that alignment is not a static goal but an evolving equilibrium, shaped both by power asymmetries and path dependence.

[969] LAYA: Layer-wise Attention Aggregation for Interpretable Depth-Aware Neural Networks

Gennaro Vessio

Main category: cs.LG

TL;DR: LAYA is a novel output head that uses attention to dynamically aggregate features from all network layers instead of just the final layer, improving performance and providing interpretable layer attribution scores.

DetailsMotivation: Standard neural networks only use the final layer for predictions, discarding rich complementary information from intermediate layers that contain different abstraction levels.

Method: LAYA (Layer-wise Attention Aggregator) learns input-conditioned attention weights over layer-wise features, creating an architecture-agnostic mechanism that dynamically combines representations from all layers.

Result: Experiments on vision and language benchmarks show LAYA consistently matches or improves performance over standard output heads, with up to ~1% accuracy gains, while providing interpretable layer-attribution scores.

Conclusion: LAYA demonstrates that aggregating features from all layers through attention improves performance and provides intrinsic interpretability by revealing how different abstraction levels contribute to decisions, without requiring external post hoc explanations.

Abstract: Deep neural networks typically rely on the representation produced by their final hidden layer to make predictions, implicitly assuming that this single vector fully captures the semantics encoded across all preceding transformations. However, intermediate layers contain rich and complementary information – ranging from low-level patterns to high-level abstractions – that is often discarded when the decision head depends solely on the last representation. This paper revisits the role of the output layer and introduces LAYA (Layer-wise Attention Aggregator), a novel output head that dynamically aggregates internal representations through attention. Instead of projecting only the deepest embedding, LAYA learns input-conditioned attention weights over layer-wise features, yielding an interpretable and architecture-agnostic mechanism for synthesizing predictions. Experiments on vision and language benchmarks show that LAYA consistently matches or improves the performance of standard output heads, with relative gains of up to about one percentage point in accuracy, while providing explicit layer-attribution scores that reveal how different abstraction levels contribute to each decision. Crucially, these interpretability signals emerge directly from the model’s computation, without any external post hoc explanations. The code to reproduce LAYA is publicly available at: https://github.com/gvessio/LAYA.

[970] Expressive Temporal Specifications for Reward Monitoring

Omar Adalat, Francesco Belardinelli

Main category: cs.LG

TL;DR: Using quantitative Linear Temporal Logic on finite traces (LTL_f[F]) to create dense reward monitors that provide continuous feedback during RL training, overcoming sparse reward issues in long-horizon tasks.

DetailsMotivation: Addressing the challenge of specifying informative and dense reward functions in Reinforcement Learning, particularly to mitigate sparse reward problems in long-horizon decision making that arise under Boolean semantics.

Method: Harness quantitative Linear Temporal Logic on finite traces (LTL_f[F]) to synthesize reward monitors that generate dense reward streams for observable state trajectories, using a state labelling function in an algorithm-agnostic framework.

Result: Quantitative monitors consistently subsume and outperform Boolean monitors in maximizing task completion measures and reducing convergence time across different environments.

Conclusion: The proposed quantitative LTL_f[F] framework effectively addresses sparse reward challenges in RL by providing dense, nuanced feedback that guides agents toward optimal behavior more efficiently than traditional Boolean approaches.

Abstract: Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

[971] Convolutional Model Trees

William Ward Armstrong

Main category: cs.LG

TL;DR: A method for creating forests of model trees to fit functions on images, handling distortions through convolutions and achieving smooth fits with theoretical convergence guarantees.

DetailsMotivation: To develop a robust method for fitting functions defined on images that can handle various distortions like rotations and perspective changes while maintaining accuracy and smoothness.

Method: Down-sampling images, determining tree hyperplanes, applying convolutions to handle small distortions, creating forests of model trees, and smoothing outputs for continuous differentiability.

Result: The method establishes a 1-to-1 correspondence among pixels, hyperplane coefficients, and leaf functions, enabling handling of larger distortions like arbitrary rotations and perspective changes.

Conclusion: The proposed framework provides a theoretically sound approach with proven convergence for training, offering smooth and accurate function approximations on images despite distortions.

Abstract: A method for creating a forest of model trees to fit samples of a function defined on images is described in several steps: down-sampling the images, determining a tree’s hyperplanes, applying convolutions to the hyperplanes to handle small distortions of training images, and creating forests of model trees to increase accuracy and achieve a smooth fit. A 1-to-1 correspondence among pixels of images, coefficients of hyperplanes and coefficients of leaf functions offers the possibility of dealing with larger distortions such as arbitrary rotations or changes of perspective. A theoretical method for smoothing forest outputs to produce a continuously differentiable approximation is described. Within that framework, a training procedure is proved to converge.

[972] Catastrophic Forgetting in Kolmogorov-Arnold Networks

Mohammad Marufur Rahman, Guanchu Wang, Kaixiong Zhou, Minghan Chen, Fan Yang

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) show promise for resisting catastrophic forgetting in low-dimensional settings but remain vulnerable in high-dimensional domains like image classification and language modeling.

DetailsMotivation: To understand the practical behavior and limitations of KANs in continual learning, as their intrinsic resistance to forgetting through localized spline-based activations remains unclear despite architectural advances.

Method: Developed a theoretical framework linking forgetting to activation support overlap and intrinsic data dimension, conducted systematic experiments on synthetic and vision tasks, and introduced KAN-LoRA for parameter-efficient continual fine-tuning of language models.

Result: KANs exhibit promising retention in low-dimensional algorithmic settings but remain vulnerable to forgetting in high-dimensional domains; KAN-LoRA shows effectiveness in knowledge editing tasks.

Conclusion: The study advances understanding of KANs’ strengths and limitations in continual learning, providing practical insights for system design while highlighting that KANs alone don’t fully solve catastrophic forgetting in complex domains.

Abstract: Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension. We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameter-efficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in low-dimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs’ strengths and limitations, offering practical insights for continual learning system design.

[973] Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering

Zhongteng Cai, Yaxuan Wang, Yang Liu, Xueru Zhang

Main category: cs.LG

TL;DR: Proposes Latent Space Filtering (LSF) to mitigate model collapse in self-consuming generative models by filtering synthetic data based on latent space degradation, outperforming existing methods without extra cost or human annotation.

DetailsMotivation: Address model collapse in self-consuming generative models where synthetic data is reused for training, avoiding expensive solutions like accumulating historical data or human annotation.

Method: Analyze latent space dynamics of diffusion models, observe degradation in synthetic data representations, and filter out less realistic synthetic data using latent space filtering.

Result: LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or requiring human annotation.

Conclusion: Latent Space Filtering provides an effective solution to model collapse by leveraging latent space degradation patterns, offering a practical alternative to costly existing approaches.

Abstract: As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a ``self-consuming loop" that can lead to training instability or \textit{model collapse}. Common strategies to address the issue – such as accumulating historical training data or injecting fresh real data – either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose \textit{Latent Space Filtering} (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

[974] Connectivity-Guided Sparsification of 2-FWL GNNs: Preserving Full Expressivity with Improved Efficiency

Rongqin Chen, Fan Mo, Pak Lon Ip, Shenghui Zhang, Dan Wu, Ye Li, Leong Hou U

Main category: cs.LG

TL;DR: Co-Sparsify is a connectivity-aware sparsification framework that reduces computational cost of higher-order GNNs from O(n³) to O(n²) while preserving full 2-FWL expressive power by restricting 3-node interactions to biconnected components.

DetailsMotivation: Existing efficiency methods for higher-order GNNs reduce computational burden but sacrifice expressivity. There's a need for methods that maintain full expressive power while being computationally efficient.

Method: Co-Sparsify restricts 2-node message passing to connected components and 3-node interactions to biconnected components, eliminating provably redundant computations without approximation or sampling.

Result: Co-Sparsified GNNs match or exceed accuracy on synthetic substructure counting tasks and achieve state-of-the-art performance on real-world benchmarks (ZINC, QM9) while reducing computational complexity.

Conclusion: High expressivity and scalability are not mutually exclusive - principled, topology-guided sparsification enables powerful, efficient GNNs with theoretical guarantees.

Abstract: Higher-order Graph Neural Networks (HOGNNs) based on the 2-FWL test achieve superior expressivity by modeling 2- and 3-node interactions, but at $\mathcal{O}(n^3)$ computational cost. However, this computational burden is typically mitigated by existing efficiency methods at the cost of reduced expressivity. We propose \textbf{Co-Sparsify}, a connectivity-aware sparsification framework that eliminates \emph{provably redundant} computations while preserving full 2-FWL expressive power. Our key insight is that 3-node interactions are expressively necessary only within \emph{biconnected components} – maximal subgraphs where every pair of nodes lies on a cycle. Outside these components, structural relationships can be fully captured via 2-node message passing or global readout, rendering higher-order modeling unnecessary. Co-Sparsify restricts 2-node message passing to connected components and 3-node interactions to biconnected ones, removing computation without approximation or sampling. We prove that Co-Sparsified GNNs are as expressive as the 2-FWL test. Empirically, on PPGN, Co-Sparsify matches or exceeds accuracy on synthetic substructure counting tasks and achieves state-of-the-art performance on real-world benchmarks (ZINC, QM9). This study demonstrates that high expressivity and scalability are not mutually exclusive: principled, topology-guided sparsification enables powerful, efficient GNNs with theoretical guarantees.

[975] DIVIDE: A Framework for Learning from Independent Multi-Mechanism Data Using Deep Encoders and Gaussian Processes

Vivek Chawla, Boris Slautin, Utkarsh Pratiush, Dayakar Penumadu, Sergei Kalinin

Main category: cs.LG

TL;DR: DIVIDE is a framework that disentangles multiple independent mechanisms in scientific datasets using mechanism-specific deep encoders and structured Gaussian Processes to separate their individual contributions.

DetailsMotivation: Scientific datasets often combine influences from multiple independent mechanisms (spatial, categorical, structural) that obscure individual contributions, making it difficult to understand each mechanism's specific effect.

Method: Integrates mechanism-specific deep encoders with a structured Gaussian Process in a joint latent space. Encoders isolate distinct mechanisms while Gaussian Process captures combined effects with uncertainty calibration. Supports structured priors for interpretable predictions.

Result: Successfully separates mechanisms on synthetic datasets, FerroSIM spin lattice simulations, and experimental PFM hysteresis loops. Reproduces additive and scaled interactions, remains robust under noise, and enables mechanism-aware prediction and efficient active learning.

Conclusion: DIVIDE effectively disentangles multiple generative factors in scientific datasets, providing interpretable mechanism separation that extends naturally to multifunctional datasets with coexisting physical responses.

Abstract: Scientific datasets often arise from multiple independent mechanisms such as spatial, categorical or structural effects, whose combined influence obscures their individual contributions. We introduce DIVIDE, a framework that disentangles these influences by integrating mechanism-specific deep encoders with a structured Gaussian Process in a joint latent space. Disentanglement here refers to separating independently acting generative factors. The encoders isolate distinct mechanisms while the Gaussian Process captures their combined effect with calibrated uncertainty. The architecture supports structured priors, enabling interpretable and mechanism-aware prediction as well as efficient active learning. DIVIDE is demonstrated on synthetic datasets combining categorical image patches with nonlinear spatial fields, on FerroSIM spin lattice simulations of ferroelectric patterns, and on experimental PFM hysteresis loops from PbTiO3 films. Across benchmarks, DIVIDE separates mechanisms, reproduces additive and scaled interactions, and remains robust under noise. The framework extends naturally to multifunctional datasets where mechanical, electromagnetic or optical responses coexist.

[976] RoS-Guard: Robust and Scalable Online Change Detection with Delay-Optimal Guarantees

Zelin Zhu, Yancheng Huang, Kai Yang

Main category: cs.LG

TL;DR: RoS-Guard is a robust online change detection algorithm for linear systems with uncertainty that uses neural unrolling for efficient GPU-accelerated computation.

DetailsMotivation: Existing OCD methods assume precise system knowledge and struggle with efficiency in large-scale systems, which is unrealistic due to estimation errors and environmental variations.

Method: Proposes RoS-Guard through tight relaxation and reformulation of OCD optimization problem, employing neural unrolling for efficient parallel computation via GPU acceleration.

Result: Extensive experiments validate effectiveness and demonstrate significant computational speedup in large-scale system scenarios.

Conclusion: RoS-Guard provides theoretical guarantees on performance (expected false alarm rate and worst-case average detection delay) and offers an efficient solution for robust change detection in uncertain linear systems.

Abstract: Online change detection (OCD) aims to rapidly identify change points in streaming data and is critical in applications such as power system monitoring, wireless network sensing, and financial anomaly detection. Existing OCD methods typically assume precise system knowledge, which is unrealistic due to estimation errors and environmental variations. Moreover, existing OCD methods often struggle with efficiency in large-scale systems. To overcome these challenges, we propose RoS-Guard, a robust and optimal OCD algorithm tailored for linear systems with uncertainty. Through a tight relaxation and reformulation of the OCD optimization problem, RoS-Guard employs neural unrolling to enable efficient parallel computation via GPU acceleration. The algorithm provides theoretical guarantees on performance, including expected false alarm rate and worst-case average detection delay. Extensive experiments validate the effectiveness of RoS-Guard and demonstrate significant computational speedup in large-scale system scenarios.

[977] Conformal Online Learning of Deep Koopman Linear Embeddings

Ben Gao, Jordan Patracone, Stéphane Chrétien, Olivier Alata

Main category: cs.LG

TL;DR: COLoKe is a framework that adaptively updates Koopman embeddings for nonlinear systems from streaming data using deep feature learning and multistep prediction consistency, with conformal-style updates triggered only when prediction errors exceed dynamic thresholds.

DetailsMotivation: To develop an adaptive framework for learning Koopman-invariant representations from streaming data that prevents overfitting and reduces unnecessary updates while maintaining predictive accuracy.

Method: Combines deep feature learning with multistep prediction consistency in lifted linear space, using conformal-style mechanism to assess model consistency and trigger updates only when prediction errors exceed dynamically calibrated thresholds.

Result: Empirical results show COLoKe effectively maintains long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting on benchmark dynamical systems.

Conclusion: COLoKe provides an effective framework for adaptive Koopman learning that balances model accuracy with computational efficiency through selective refinement based on prediction consistency.

Abstract: We introduce Conformal Online Learning of Koopman embeddings (COLoKe), a novel framework for adaptively updating Koopman-invariant representations of nonlinear dynamical systems from streaming data. Our modeling approach combines deep feature learning with multistep prediction consistency in the lifted space, where the dynamics evolve linearly. To prevent overfitting, COLoKe employs a conformal-style mechanism that shifts the focus from evaluating the conformity of new states to assessing the consistency of the current Koopman model. Updates are triggered only when the current model’s prediction error exceeds a dynamically calibrated threshold, allowing selective refinement of the Koopman operator and embedding. Empirical results on benchmark dynamical systems demonstrate the effectiveness of COLoKe in maintaining long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting.

[978] From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability

Jihoon Moon

Main category: cs.LG

TL;DR: A control theory framework analyzes neural networks as dynamical systems using linearization, Gramians, and Hankel singular values to understand neuron importance and internal computations.

DetailsMotivation: Deep neural networks achieve state-of-the-art performance but are difficult to interpret mechanistically, requiring methods to understand their internal computations.

Method: Treat trained neural networks as nonlinear state space systems, linearize around hidden activations, construct state space models, and compute controllability/observability Gramians and Hankel singular values.

Result: The framework provides principled measures of neuron importance (controllability for input influence, observability for output influence) and identifies dominant internal modes through Hankel singular values.

Conclusion: The method transforms neural networks into local white-box dynamical models, revealing which internal directions are candidates for pruning or constraints to improve interpretability.

Abstract: Deep neural networks achieve state of the art performance but remain difficult to interpret mechanistically. In this work, we propose a control theoretic framework that treats a trained neural network as a nonlinear state space system and uses local linearization, controllability and observability Gramians, and Hankel singular values to analyze its internal computation. For a given input, we linearize the network around the corresponding hidden activation pattern and construct a state space model whose state consists of hidden neuron activations. The input state and state output Jacobians define local controllability and observability Gramians, from which we compute Hankel singular values and associated modes. These quantities provide a principled notion of neuron and pathway importance: controllability measures how easily each neuron can be excited by input perturbations, observability measures how strongly each neuron influences the output, and Hankel singular values rank internal modes that carry input output energy. We illustrate the framework on simple feedforward networks, including a 1 2 2 1 SwiGLU network and a 2 3 3 2 GELU network. By comparing different operating points, we show how activation saturation reduces controllability, shrinks the dominant Hankel singular value, and shifts the dominant internal mode to a different subset of neurons. The proposed method turns a neural network into a collection of local white box dynamical models and suggests which internal directions are natural candidates for pruning or constraints to improve interpretability.

[979] INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers

Hao Wei, Aleksandra Franz, Bjoern List, Nils Thuerey

Main category: cs.LG

TL;DR: Proposes Indirect Neural Corrector (INC) that integrates learned corrections into governing equations instead of direct state updates, reducing autoregressive errors in hybrid PDE solvers for stable long-term simulations.

DetailsMotivation: Hybrid solvers combining coarse numerical solvers with learned correctors suffer from significant autoregressive errors due to amplified perturbations accumulating during long-term rollouts, especially in chaotic regimes.

Method: INC integrates learned corrections into the governing equations rather than applying direct state updates, reducing error amplification by order of Δt⁻¹ + L where Δt is timestep and L is Lipschitz constant. Works with arbitrary neural networks and solvers.

Result: INC improves long-term trajectory performance (R²) by up to 158.7%, stabilizes blowups under aggressive coarsening, and yields speed-ups of several orders of magnitude for complex 3D turbulence cases.

Conclusion: INC enables stable, efficient PDE emulation with formal error reduction, paving the way for faster scientific and engineering simulations with reliable physics guarantees.

Abstract: When simulating partial differential equations, hybrid solvers combine coarse numerical solvers with learned correctors. They promise accelerated simulations while adhering to physical constraints. However, as shown in our theoretical framework, directly applying learned corrections to solver outputs leads to significant autoregressive errors, which originate from amplified perturbations that accumulate during long-term rollouts, especially in chaotic regimes. To overcome this, we propose the Indirect Neural Corrector ((\mathrm{INC})), which integrates learned corrections into the governing equations rather than applying direct state updates. Our key insight is that (\mathrm{INC}) reduces the error amplification on the order of (Δt^{-1} + L), where (Δt) is the timestep and $L$ the Lipschitz constant. At the same time, our framework poses no architectural requirements and integrates seamlessly with arbitrary neural networks and solvers. We test (\mathrm{INC}) in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence. INC improves the long-term trajectory performance ((R^2)) by up to 158.7%, stabilizes blowups under aggressive coarsening, and for complex 3D turbulence cases yields speed-ups of several orders of magnitude. INC thus enables stable, efficient PDE emulation with formal error reduction, paving the way for faster scientific and engineering simulations with reliable physics guarantees. Our source code is available at https://github.com/tum-pbs/INC

[980] An approach of deep reinforcement learning for maximizing the net present value of stochastic projects

Wei Xu, Fan Yang, Qinyuan Cui, Zhi Chen

Main category: cs.LG

TL;DR: DDQN approach for project optimization with stochastic durations and cash flows maximizes expected NPV by accelerating inflows and deferring outflows, outperforming traditional methods in large-scale uncertain environments.

DetailsMotivation: To address project optimization with stochastic activity durations and cash flows under discrete scenarios, where traditional rigid and dynamic strategies struggle with large-scale or highly uncertain environments.

Method: Formulated as discrete-time Markov Decision Process (MDP) and solved using Double Deep Q-Network (DDQN) approach with dual-network architecture and target network.

Result: DDQN outperforms traditional strategies, showing superior computational capability, policy reliability, and adaptability. Ablation studies confirm dual-network reduces action value overestimation and target network improves training convergence.

Conclusion: DDQN achieves higher expected NPV in complex project optimization and provides a reliable framework for stable policy implementation in stochastic project environments.

Abstract: This paper investigates a project with stochastic activity durations and cash flows under discrete scenarios, where activities must satisfy precedence constraints generating cash inflows and outflows. The objective is to maximize expected net present value (NPV) by accelerating inflows and deferring outflows. We formulate the problem as a discrete-time Markov Decision Process (MDP) and propose a Double Deep Q-Network (DDQN) approach. Comparative experiments demonstrate that DDQN outperforms traditional rigid and dynamic strategies, particularly in large-scale or highly uncertain environments, exhibiting superior computational capability, policy reliability, and adaptability. Ablation studies further reveal that the dual-network architecture mitigates overestimation of action values, while the target network substantially improves training convergence and robustness. These results indicate that DDQN not only achieves higher expected NPV in complex project optimization but also provides a reliable framework for stable and effective policy implementation.

[981] MolEdit: Knowledge Editing for Multimodal Molecule Language Models

Zhenyu Lei, Patrick Soga, Yaochen Zhu, Yinhan He, Yushun Dong, Jundong Li

Main category: cs.LG

TL;DR: MolEdit is a novel framework for editing molecule language models (MoLMs) that enables targeted modifications while preserving unrelated molecular knowledge through specialized experts and an editing switcher.

DetailsMotivation: Molecule language models can encode and propagate inaccuracies from training data, but knowledge editing for MoLMs remains unexplored despite being crucial for reliable biomedical and chemical applications.

Method: Proposes MolEdit framework with Multi-Expert Knowledge Adapter that routes edits to specialized experts for different molecular facets, and Expertise-Aware Editing Switcher that activates adapters only when input matches stored edits.

Result: MolEdit achieves up to 18.8% higher Reliability and 12.0% better Locality than baselines across extensive experiments on two popular MoLM backbones while maintaining efficiency.

Conclusion: MolEdit provides an effective solution for editing molecule language models, addressing unique challenges in molecular knowledge editing and enabling more reliable downstream discovery pipelines.

Abstract: Understanding and continuously refining multimodal molecular knowledge is crucial for advancing biomedicine, chemistry, and materials science. Molecule language models (MoLMs) have become powerful tools in these domains, integrating structural representations (e.g., SMILES strings, molecular graphs) with rich contextual descriptions (e.g., physicochemical properties). However, MoLMs can encode and propagate inaccuracies due to outdated web-mined training corpora or malicious manipulation, jeopardizing downstream discovery pipelines. While knowledge editing has been explored for general-domain AI, its application to MoLMs remains uncharted, presenting unique challenges due to the multifaceted and interdependent nature of molecular knowledge. In this paper, we take the first step toward MoLM editing for two critical tasks: molecule-to-caption generation and caption-to-molecule generation. To address molecule-specific challenges, we propose MolEdit, a powerful framework that enables targeted modifications while preserving unrelated molecular knowledge. MolEdit combines a Multi-Expert Knowledge Adapter that routes edits to specialized experts for different molecular facets with an Expertise-Aware Editing Switcher that activates the adapters only when input closely matches the stored edits across all expertise, minimizing interference with unrelated knowledge. To systematically evaluate editing performance, we introduce MEBench, a comprehensive benchmark assessing multiple dimensions, including Reliability (accuracy of the editing), Locality (preservation of irrelevant knowledge), and Generality (robustness to reformed queries). Across extensive experiments on two popular MoLM backbones, MolEdit delivers up to 18.8% higher Reliability and 12.0% better Locality than baselines while maintaining efficiency. The code is available at: https://github.com/LzyFischer/MolEdit.

[982] Physics-Constrained Adaptive Neural Networks Enable Real-Time Semiconductor Manufacturing Optimization with Minimal Training Data

Rubén Darío Guerrero

Main category: cs.LG

TL;DR: A physics-constrained adaptive learning framework for EUV lithography optimization that achieves sub-nanometer precision with minimal training data through automatic calibration of electromagnetic approximations and cross-geometry generalization.

DetailsMotivation: Address the computational crisis in semiconductor EUV lithography where traditional methods consume billions of CPU hours and fail to achieve sub-nanometer precision needed for advanced manufacturing.

Method: Physics-constrained adaptive learning framework with learnable parameters for electromagnetic approximations, integrating differentiable modules for Fresnel diffraction, material absorption, optical blur, phase-shift effects, and contrast modulation with geometric pattern matching objectives.

Result: Achieved consistent sub-nanometer EPE performance (0.664-2.536 nm range) using only 50 training samples per pattern, with 69.9% average improvement over CNN baselines and 90% fewer training samples through cross-geometry generalization.

Conclusion: Establishes physics-constrained adaptive learning as a foundational methodology for real-time semiconductor manufacturing optimization, bridging the gap between academic physics-informed neural networks and industrial deployment requirements.

Abstract: The semiconductor industry faces a computational crisis in extreme ultraviolet (EUV) lithography optimization, where traditional methods consume billions of CPU hours while failing to achieve sub-nanometer precision. We present a physics-constrained adaptive learning framework that automatically calibrates electromagnetic approximations through learnable parameters $\boldsymbolθ = {θ_d, θ_a, θ_b, θ_p, θ_c}$ while simultaneously minimizing Edge Placement Error (EPE) between simulated aerial images and target photomasks. The framework integrates differentiable modules for Fresnel diffraction, material absorption, optical point spread function blur, phase-shift effects, and contrast modulation with direct geometric pattern matching objectives, enabling cross-geometry generalization with minimal training data. Through physics-constrained learning on 15 representative patterns spanning current production to future research nodes, we demonstrate consistent sub-nanometer EPE performance (0.664-2.536 nm range) using only 50 training samples per pattern. Adaptive physics learning achieves an average improvement of 69.9% over CNN baselines without physics constraints, with a significant inference speedup over rigorous electromagnetic solvers after training completion. This approach requires 90% fewer training samples through cross-geometry generalization compared to pattern-specific CNN training approaches. This work establishes physics-constrained adaptive learning as a foundational methodology for real-time semiconductor manufacturing optimization, addressing the critical gap between academic physics-informed neural networks and industrial deployment requirements through joint physics calibration and manufacturing precision objectives.

[983] Contrastive Entropy Bounds for Density and Conditional Density Decomposition

Bo Hu, Jose C. Principe

Main category: cs.LG

TL;DR: This paper analyzes neural network interpretability through a Bayesian Gaussian framework, showing that autoencoders maximize Gaussian operator trace while MDNs can use nuclear norm as divergence. It proposes Hilbert space methods to improve sample diversity and prevent trivial solutions.

DetailsMotivation: To understand neural network feature interpretability from a probabilistic perspective, connecting optimization objectives to probabilistic bounds and Gaussian mixture densities.

Method: Uses Hilbert space decomposition and Gaussian operators to analyze neural networks. Proposes trace maximization for autoencoders and nuclear norm for MDNs. Introduces encoder-mixture-decoder architecture with multiple-output decoders.

Result: Shows autoencoder objective equals maximizing trace of Gaussian operator. Nuclear norm can serve as divergence for MDNs. Hilbert space bounds increase sample diversity and prevent trivial constant outputs.

Conclusion: Bayesian Gaussian view provides interpretable framework for neural networks, with Hilbert space methods offering quantitative analysis of bounds and improved training objectives.

Abstract: This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder’s objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively.

[984] Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

Shasha Zhou, Mingyu Huang, Jack Cole, Charles Britton, Ming Yin, Jan Wolber, Ke Li

Main category: cs.LG

TL;DR: FAITH framework uses medical knowledge graphs for automated factuality evaluation of LLM responses in healthcare without needing reference answers, achieving high correlation with clinician judgments.

DetailsMotivation: Deploying LLMs in high-stakes healthcare requires rigorous verification to understand potential harm, necessitating automated factuality assessment methods.

Method: FAITH framework decomposes LLM responses into atomic claims, links them to medical knowledge graphs, and scores based on evidence paths without requiring reference answers.

Result: KG-grounded evaluation achieves high correlation with clinician judgments, effectively distinguishes LLMs with varying capabilities, and is robust to textual variances with explainable scoring.

Conclusion: While limitations exist, leveraging knowledge graphs is a prominent direction for automated factuality assessment in healthcare.

Abstract: The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

[985] LinkedIn Profile Characteristics and Professional Success Indicators

Tania-Amanda Fredrick Eneye, Ashlesha Malla, Pawan Paudel

Main category: cs.LG

TL;DR: LinkedIn profile analysis using ML shows promotions are highly predictable but follower growth is more complex, offering career optimization insights.

DetailsMotivation: To understand how LinkedIn profile characteristics relate to professional success metrics like promotions, follower count, and career progression.

Method: Analyzed 62,000+ anonymized LinkedIn profiles using machine learning predictive models to identify influential success factors.

Result: Promotions are highly predictable from profile data, while follower growth shows greater complexity and is harder to predict.

Conclusion: The research provides actionable insights for professionals to optimize their LinkedIn presence and career development strategies.

Abstract: This study explores the relationship between LinkedIn profile characteristics and professional success, focusing on the indicators of promotions, follower count, and career progression rate. By leveraging a dataset of over 62,000 anonymized LinkedIn profiles, we developed predictive models using machine learning techniques to identify the most influential factors driving professional success. Results indicate that while promotions are highly predictable, follower growth exhibits greater complexity. This research provides actionable insights for professionals seeking to optimize their LinkedIn presence and career strategies.

[986] An Evaluation of Representation Learning Methods in Particle Physics Foundation Models

Michael Chen, Raghav Kansal, Abhijith Gandrakota, Zichun Hao, Jennifer Ngadiuba, Maria Spiropulu

Main category: cs.LG

TL;DR: Systematic comparison of representation learning objectives for particle physics using a unified transformer framework, achieving state-of-the-art performance with targeted supervised modifications.

DetailsMotivation: To provide a controlled comparison of different representation learning objectives (contrastive, masked modeling, generative reconstruction) for particle physics, enabling transparent progress in foundation model development.

Method: Used shared transformer-based particle-cloud encoder with standardized preprocessing, matched sampling, and consistent evaluation protocol on jet classification dataset. Compared multiple objectives under common training regimen.

Result: Introduced targeted supervised architectural modifications that achieved state-of-the-art performance on benchmark evaluations. Isolated contributions of learning objectives and highlighted their respective strengths.

Conclusion: This work serves as a reference point for future foundation model development in particle physics, providing reproducible baselines for more transparent and robust community progress.

Abstract: We present a systematic evaluation of representation learning objectives for particle physics within a unified framework. Our study employs a shared transformer-based particle-cloud encoder with standardized preprocessing, matched sampling, and a consistent evaluation protocol on a jet classification dataset. We compare contrastive (supervised and self-supervised), masked particle modeling, and generative reconstruction objectives under a common training regimen. In addition, we introduce targeted supervised architectural modifications that achieve state-of-the-art performance on benchmark evaluations. This controlled comparison isolates the contributions of the learning objective, highlights their respective strengths and limitations, and provides reproducible baselines. We position this work as a reference point for the future development of foundation models in particle physics, enabling more transparent and robust progress across the community.

[987] On the Information Processing of One-Dimensional Wasserstein Distances with Finite Samples

Cheongjae Jang, Jonghyun Won, Soyeon Jun, Chun Kee Chung, Keehyoung Joo, Yung-Kyun Noh

Main category: cs.LG

TL;DR: Analysis of 1D Wasserstein distance’s ability to capture pointwise density differences when supports overlap, using Poisson processes and empirical validation with neural spike trains and amino acid data.

DetailsMotivation: To understand whether Wasserstein distance can identify pointwise density differences when supports significantly overlap but densities differ substantially, especially in finite-sample settings.

Method: Used Poisson process analysis to isolate rate factors, demonstrating how Wasserstein distance captures pointwise density differences and harmonizes with support differences.

Result: The 1D Wasserstein distance successfully highlights meaningful density differences related to both rate and support, as confirmed by neural spike train decoding and amino acid contact frequency data.

Conclusion: Wasserstein distance can effectively capture both support and pointwise density differences, providing valuable information even when supports overlap significantly.

Abstract: Leveraging the Wasserstein distance – a summation of sample-wise transport distances in data space – is advantageous in many applications for measuring support differences between two underlying density functions. However, when supports significantly overlap while densities exhibit substantial pointwise differences, it remains unclear whether and how this transport information can accurately identify these differences, particularly their analytic characterization in finite-sample settings. We address this issue by conducting an analysis of the information processing capabilities of the one-dimensional Wasserstein distance with finite samples. By utilizing the Poisson process and isolating the rate factor, we demonstrate the capability of capturing the pointwise density difference with Wasserstein distances and how this information harmonizes with support differences. The analyzed properties are confirmed using neural spike train decoding and amino acid contact frequency data. The results reveal that the one-dimensional Wasserstein distance highlights meaningful density differences related to both rate and support.

[988] Method of Manufactured Learning for Solver-free Training of Neural Operators

Arth Sojitra, Omer San

Main category: cs.LG

TL;DR: MML is a solver-independent framework that trains neural operators using analytically constructed physics-consistent datasets instead of numerical simulations, achieving high accuracy across various PDE benchmarks.

DetailsMotivation: Traditional neural operator training requires extensive datasets from expensive numerical solvers or experiments, limiting scalability and exploration across physical systems.

Method: MML replaces numerical data generation with functional synthesis - sampling smooth analytical solutions and deriving corresponding forcing fields by applying governing differential operators directly.

Result: MML achieves high spectral accuracy, low residual errors, and strong generalization across heat, advection, Burgers, and diffusion-reaction equations using Fourier neural operators.

Conclusion: MML provides a scalable, solver-agnostic pathway for constructing physically grounded neural operators that maintain fidelity to governing laws without expensive numerical simulations.

Abstract: Training neural operators to approximate mappings between infinite-dimensional function spaces often requires extensive datasets generated by either demanding experimental setups or computationally expensive numerical solvers. This dependence on solver-based data limits scalability and constrains exploration across physical systems. Here we introduce the Method of Manufactured Learning (MML), a solver-independent framework for training neural operators using analytically constructed, physics-consistent datasets. Inspired by the classical method of manufactured solutions, MML replaces numerical data generation with functional synthesis, i.e., smooth candidate solutions are sampled from controlled analytical spaces, and the corresponding forcing fields are derived by direct application of the governing differential operators. During inference, setting these forcing terms to zero restores the original governing equations, allowing the trained neural operator to emulate the true solution operator of the system. The framework is agnostic to network architecture and can be integrated with any operator learning paradigm. In this paper, we employ Fourier neural operator as a representative example. Across canonical benchmarks including heat, advection, Burgers, and diffusion-reaction equations. MML achieves high spectral accuracy, low residual errors, and strong generalization to unseen conditions. By reframing data generation as a process of analytical synthesis, MML offers a scalable, solver-agnostic pathway toward constructing physically grounded neural operators that retain fidelity to governing laws without reliance on expensive numerical simulations or costly experimental data for training.

[989] Functional Mean Flow in Hilbert Space

Zhiqi Li, Yuchen Sun, Greg Turk, Bo Zhu

Main category: cs.LG

TL;DR: FMF is a one-step generative model in Hilbert space that extends Mean Flow to functional domains, with improved stability via x₁-prediction.

DetailsMotivation: To create a practical one-step Flow Matching method for functional data generation tasks like time series, images, PDEs, and 3D geometry.

Method: Extends Mean Flow to functional domains with Functional Flow Matching theory and practical implementation, introducing x₁-prediction for stability.

Result: Developed a framework that enables efficient training and sampling for functional data generation.

Conclusion: FMF provides a practical one-step Flow Matching approach applicable to diverse functional data generation tasks.

Abstract: We present Functional Mean Flow (FMF) as a one-step generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an $x_1$-prediction variant that improves stability over the original $u$-prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.

[990] Global Cross-Time Attention Fusion for Enhanced Solar Flare Prediction from Multivariate Time Series

Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

Main category: cs.LG

TL;DR: Proposes GCTAF, a transformer-based model with global cross-attention tokens to improve solar flare prediction by capturing long-range temporal patterns in imbalanced multivariate time series data.

DetailsMotivation: Address the challenge of imbalanced solar flare occurrences where intense flares are rare, and improve long-range temporal modeling for better prediction of disruptive solar flare events.

Method: Global Cross-Time Attention Fusion (GCTAF) architecture using learnable cross-attentive global tokens that summarize temporal patterns across entire sequences and fuse them back into temporal representations.

Result: GCTAF effectively detects intense flares and improves predictive performance on benchmark solar flare datasets compared to traditional approaches.

Conclusion: Refining transformer-based architectures with global attention mechanisms presents a high-potential alternative for solar flare prediction tasks, enabling better capture of discriminative flare-related dynamics.

Abstract: Multivariate time series classification is increasingly investigated in space weather research as a means to predict intense solar flare events, which can cause widespread disruptions across modern technological systems. Magnetic field measurements of solar active regions are converted into structured multivariate time series, enabling predictive modeling across segmented observation windows. However, the inherently imbalanced nature of solar flare occurrences, where intense flares are rare compared to minor flare events, presents a significant barrier to effective learning. To address this challenge, we propose a novel Global Cross-Time Attention Fusion (GCTAF) architecture, a transformer-based model to enhance long-range temporal modeling. Unlike traditional self-attention mechanisms that rely solely on local interactions within time series, GCTAF injects a set of learnable cross-attentive global tokens that summarize salient temporal patterns across the entire sequence. These tokens are refined through cross-attention with the input sequence and fused back into the temporal representation, enabling the model to identify globally significant, non-contiguous time points that are critical for flare prediction. This mechanism functions as a dynamic attention-driven temporal summarizer that augments the model’s capacity to capture discriminative flare-related dynamics. We evaluate our approach on the benchmark solar flare dataset and show that GCTAF effectively detects intense flares and improves predictive performance, demonstrating that refining transformer-based architectures presents a high-potential alternative for solar flare prediction tasks.

[991] AIF: Asynchronous Inference Framework for Cost-Effective Pre-Ranking

Zhi Kou, Xiang-Rong Sheng, Shuguang Han, Zhishan Zhao, Yueyao Cheng, Han Zhu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: AIF is an asynchronous inference framework that decouples interaction-independent computations from real-time prediction in pre-ranking models, enabling parallel processing and reducing latency while improving model capacity.

DetailsMotivation: Traditional sequential execution in pre-ranking models causes redundant computations of identical users/items and increased latency due to strictly sequential operations, limiting model capacity and system efficiency.

Method: Decouples interaction-independent components from real-time prediction, performs user-side computations in parallel with retrieval stage, conducts item-side computations nearline, and uses approximated methods for interaction-dependent components in online predictions.

Result: Enhanced computational efficiency, reduced latency, freed up resources for improved feature sets and model architecture, and achieved notable performance gains without significant computational/latency cost increases.

Conclusion: AIF successfully addresses bottlenecks in pre-ranking models through asynchronous computation and has been successfully deployed in Taobao display advertising system.

Abstract: In industrial recommendation systems, pre-ranking models based on deep neural networks (DNNs) commonly adopt a sequential execution framework: feature fetching and model forward computation are triggered only after receiving candidates from the upstream retrieval stage. This design introduces inherent bottlenecks, including redundant computations of identical users/items and increased latency due to strictly sequential operations, which jointly constrain the model’s capacity and system efficiency. To address these limitations, we propose the Asynchronous Inference Framework (AIF), a cost-effective computational architecture that decouples interaction-independent components, those operating within a single user or item, from real-time prediction. AIF reorganizes the model inference process by performing user-side computations in parallel with the retrieval stage and conducting item-side computations in a nearline manner. This means that interaction-independent components are calculated just once and completed before the real-time prediction phase of the pre-ranking stage. As a result, AIF enhances computational efficiency and reduces latency, freeing up resources to significantly improve the feature set and model architecture of interaction-independent components. Moreover, we delve into model design within the AIF framework, employing approximated methods for interaction-dependent components in online real-time predictions. By co-designing both the framework and the model, our solution achieves notable performance gains without significantly increasing computational and latency costs. This has enabled the successful deployment of AIF in the Taobao display advertising system.

[992] APT: Affine Prototype-Timestamp For Time Series Forecasting Under Distribution Shift

Yujie Li, Zezhi Shao, Chengqing Yu, Yisong Fu, Tao Sun, Yongjun Xu, Fei Wang

Main category: cs.LG

TL;DR: APT is a lightweight plug-in module that improves time series forecasting under distribution shift by using timestamp-conditioned prototype learning to generate dynamic affine parameters.

DetailsMotivation: Existing methods struggle with distribution shift in time series forecasting, particularly with missing values, noise, and invalid channel-wise transformations.

Method: APT injects global distribution features via timestamp-conditioned prototype learning to dynamically generate affine parameters that modulate input and output series.

Result: Extensive experiments across six datasets show APT significantly improves forecasting performance under distribution shift with minimal computational overhead.

Conclusion: APT effectively addresses distribution shift challenges and is compatible with various forecasting backbones and normalization strategies.

Abstract: Time series forecasting under distribution shift remains challenging, as existing deep learning models often rely on local statistical normalization (e.g., mean and variance) that fails to capture global distribution shift. Methods like RevIN and its variants attempt to decouple distribution and pattern but still struggle with missing values, noisy observations, and invalid channel-wise affine transformation. To address these limitations, we propose Affine Prototype Timestamp (APT), a lightweight and flexible plug-in module that injects global distribution features into the normalization-forecasting pipeline. By leveraging timestamp conditioned prototype learning, APT dynamically generates affine parameters that modulate both input and output series, enabling the backbone to learn from self-supervised, distribution-aware clustered instances. APT is compatible with arbitrary forecasting backbones and normalization strategies while introducing minimal computational overhead. Extensive experiments across six benchmark datasets and multiple backbone-normalization combinations demonstrate that APT significantly improves forecasting performance under distribution shift.

[993] A FEDformer-Based Hybrid Framework for Anomaly Detection and Risk Forecasting in Financial Time Series

Ziling Fan, Ruijia Liang, Yiwen Hu

Main category: cs.LG

TL;DR: Proposes a FEDformer-based hybrid framework for financial anomaly detection and risk forecasting that integrates frequency analysis with residual-based detection, achieving significant improvements over traditional methods.

DetailsMotivation: Financial markets are volatile and prone to disruptions like crashes and liquidity crises. Traditional deep learning models fail to capture long-term dependencies and complex periodic patterns in nonstationary financial data.

Method: Integrates Frequency Enhanced Decomposed Transformer (FEDformer) with residual-based anomaly detector and risk forecasting head. FEDformer models temporal dynamics in time/frequency domains, decomposing signals into trend and seasonal components.

Result: Experiments on S&P 500, NASDAQ Composite, and Brent Crude Oil datasets (2000-2024) show 15.7% RMSE reduction and 11.5% F1-score improvement for anomaly detection compared to benchmarks.

Conclusion: The model effectively captures financial volatility and enables reliable early-warning systems for market crash prediction and risk management.

Abstract: Financial markets are inherently volatile and prone to sudden disruptions such as market crashes, flash collapses, and liquidity crises. Accurate anomaly detection and early risk forecasting in financial time series are therefore crucial for preventing systemic instability and supporting informed investment decisions. Traditional deep learning models, such as LSTM and GRU, often fail to capture long-term dependencies and complex periodic patterns in highly nonstationary financial data. To address this limitation, this study proposes a FEDformer-Based Hybrid Framework for Anomaly Detection and Risk Forecasting in Financial Time Series, which integrates the Frequency Enhanced Decomposed Transformer (FEDformer) with a residual-based anomaly detector and a risk forecasting head. The FEDformer module models temporal dynamics in both time and frequency domains, decomposing signals into trend and seasonal components for improved interpretability. The residual-based detector identifies abnormal fluctuations by analyzing prediction errors, while the risk head predicts potential financial distress using learned latent embeddings. Experiments conducted on the S&P 500, NASDAQ Composite, and Brent Crude Oil datasets (2000-2024) demonstrate the superiority of the proposed model over benchmark methods, achieving a 15.7 percent reduction in RMSE and an 11.5 percent improvement in F1-score for anomaly detection. These results confirm the effectiveness of the model in capturing financial volatility, enabling reliable early-warning systems for market crash prediction and risk management.

[994] RAGPulse: An Open-Source RAG Workload Trace to Optimize RAG Serving Systems

Zhengchao Wang, Yitao Hu, Jianing Ye, Zhuxuan Chang, Jiazheng Yu, Youpeng Deng, Keqiu Li

Main category: cs.LG

TL;DR: RAGPulse is an open-source RAG workload trace dataset collected from a university Q&A system, providing real-world data to bridge the performance gap between RAG research and deployment.

DetailsMotivation: Existing LLM inference traces fail to capture RAG-specific dynamics like multi-stage pipelines and knowledge dependency, creating a performance gap between research and real deployment.

Method: Collected workload traces from a university-wide Q&A system serving 40,000+ users since April 2024, using privacy-preserving hash-based data format and detailed statistical analysis.

Result: Analysis reveals real-world RAG workloads exhibit significant temporal locality and highly skewed hot document access patterns, providing high-fidelity data for optimization research.

Conclusion: RAGPulse enables researchers to develop and validate novel optimization strategies like content-aware batching and retrieval caching to enhance RAG service efficiency and reliability.

Abstract: Retrieval-Augmented Generation (RAG) is a critical paradigm for building reliable, knowledge-intensive Large Language Model (LLM) applications. However, the multi-stage pipeline (retrieve, generate) and unique workload characteristics (e.g., knowledge dependency) of RAG systems pose significant challenges for serving performance optimization. Existing generic LLM inference traces fail to capture these RAG-specific dynamics, creating a significant performance gap between academic research and real-world deployment. To bridge this gap, this paper introduces RAGPulse, an open-source RAG workload trace dataset. This dataset was collected from an university-wide Q&A system serving that has served more than 40,000 students and faculties since April 2024. We detail RAGPulse’s system architecture, its privacy-preserving hash-based data format, and provide an in-depth statistical analysis. Our analysis reveals that real-world RAG workloads exhibit significant temporal locality and a highly skewed hot document access pattern. RAGPulse provides a high-fidelity foundation for researchers to develop and validate novel optimization strategies for RAG systems, such as content-aware batching and retrieval caching, ultimately enhancing the efficiency and reliability of RAG services. The code is available at https://github.com/flashserve/RAGPulse.

[995] Learning Branching Policies for MILPs with Proximal Policy Optimization

Abdelouahed Ben Mhamed, Assia Kamal-Idrissi, Amal El Fallah Seghrouchni

Main category: cs.LG

TL;DR: TGPPO uses reinforcement learning (PPO) to train branching policies for MILP solvers, improving generalization across diverse instances compared to imitation learning approaches.

DetailsMotivation: Existing learning-based branching policies rely on imitation learning which overfits to expert demonstrations and struggles with generalization to structurally diverse or unseen MILP instances.

Method: Proposes Tree-Gate Proximal Policy Optimization (TGPPO) using PPO reinforcement learning with parameterized state space representation that dynamically captures the evolving context of the search tree.

Result: TGPPO outperforms existing learning-based policies in reducing nodes explored and improving p-Primal-Dual Integrals, especially for out-of-distribution instances.

Conclusion: Reinforcement learning shows strong potential for developing robust and adaptable branching strategies that generalize well across heterogeneous MILP instances.

Abstract: Branch-and-Bound (B&B) is the dominant exact solution method for Mixed Integer Linear Programs (MILP), yet its exponential time complexity poses significant challenges for large-scale instances. The growing capabilities of machine learning have spurred efforts to improve B&B by learning data-driven branching policies. However, most existing approaches rely on Imitation Learning (IL), which tends to overfit to expert demonstrations and struggles to generalize to structurally diverse or unseen instances. In this work, we propose Tree-Gate Proximal Policy Optimization (TGPPO), a novel framework that employs Proximal Policy Optimization (PPO), a Reinforcement Learning (RL) algorithm, to train a branching policy aimed at improving generalization across heterogeneous MILP instances. Our approach builds on a parameterized state space representation that dynamically captures the evolving context of the search tree. Empirical evaluations show that TGPPO often outperforms existing learning-based policies in terms of reducing the number of nodes explored and improving p-Primal-Dual Integrals (PDI), particularly in out-of-distribution instances. These results highlight the potential of RL to develop robust and adaptable branching strategies for MILP solvers.

[996] Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks

Minsoo Jo, Dongyoon Yang, Taesup Kim

Main category: cs.LG

TL;DR: Proposes a geometry-aware adversarial attack for hyperbolic neural networks that leverages angular perturbations in hyperbolic space, achieving higher fooling rates than conventional Euclidean attacks.

DetailsMotivation: Existing adversarial attacks like FGSM and PGD ignore hyperbolic geometry, leading to inefficient attacks on hyperbolic networks that require geometry-aware strategies.

Method: Computes loss gradient in hyperbolic tangent space, decomposes into radial and angular components, and applies perturbations only in angular directions to target semantic vulnerabilities.

Result: Achieves higher fooling rates than conventional attacks on image classification and cross-modal retrieval tasks, with deeper insights into hyperbolic embedding vulnerabilities.

Conclusion: Highlights the importance of geometry-aware adversarial strategies in curved representation spaces and provides a principled framework for attacking hierarchical embeddings.

Abstract: Adversarial examples in neural networks have been extensively studied in Euclidean geometry, but recent advances in \textit{hyperbolic networks} call for a reevaluation of attack strategies in non-Euclidean geometries. Existing methods such as FGSM and PGD apply perturbations without regard to the underlying hyperbolic structure, potentially leading to inefficient or geometrically inconsistent attacks. In this work, we propose a novel adversarial attack that explicitly leverages the geometric properties of hyperbolic space. Specifically, we compute the gradient of the loss function in the tangent space of hyperbolic space, decompose it into a radial (depth) component and an angular (semantic) component, and apply perturbation derived solely from the angular direction. Our method generates adversarial examples by focusing perturbations in semantically sensitive directions encoded in angular movement within the hyperbolic geometry. Empirical results on image classification, cross-modal retrieval tasks and network architectures demonstrate that our attack achieves higher fooling rates than conventional adversarial attacks, while producing high-impact perturbations with deeper insights into vulnerabilities of hyperbolic embeddings. This work highlights the importance of geometry-aware adversarial strategies in curved representation spaces and provides a principled framework for attacking hierarchical embeddings.

[997] Are Graph Transformers Necessary? Efficient Long-Range Message Passing with Fractal Nodes in MPNNs

Jeongwhan Choi, Seungjun Park, Sumin Park, Sung-Bae Cho, Noseong Park

Main category: cs.LG

TL;DR: Proposes fractal nodes to enhance GNNs by adding subgraph-level representations that capture fractal patterns in graphs, improving long-range dependencies while maintaining MPNN efficiency.

DetailsMotivation: GNNs struggle with balancing local and global information, and graph Transformers often overlook MPNN's locality and efficiency. Real-world networks exhibit fractal structures that can be leveraged.

Method: Introduces fractal nodes that coexist with original nodes and adaptively aggregate subgraph-level features, creating shortcut connections for long-range information propagation through graph partitioning.

Result: The method alleviates over-squashing, improves expressive power of MPNNs, and achieves comparable or better performance than graph Transformers while maintaining MPNN computational efficiency.

Conclusion: Fractal nodes effectively bridge the gap between MPNN efficiency and Transformer-like long-range modeling by leveraging inherent fractal structures in graphs.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning on graph-structured data, but often struggle to balance local and global information. While graph Transformers aim to address this by enabling long-range interactions, they often overlook the inherent locality and efficiency of Message Passing Neural Networks (MPNNs). We propose a new concept called fractal nodes, inspired by the fractal structure observed in real-world networks. Our approach is based on the intuition that graph partitioning naturally induces fractal structure, where subgraphs often reflect the connectivity patterns of the full graph. Fractal nodes are designed to coexist with the original nodes and adaptively aggregate subgraph-level feature representations, thereby enforcing feature similarity within each subgraph. We show that fractal nodes alleviate the over-squashing problem by providing direct shortcut connections that enable long-range propagation of subgraph-level representations. Experiment results show that our method improves the expressive power of MPNNs and achieves comparable or better performance to graph Transformers while maintaining the computational efficiency of MPNN by improving the long-range dependencies of MPNN.

[998] The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Subramanyam Sahoo

Main category: cs.LG

TL;DR: Proposes a unified framework for reward design in RLHF, evaluating hard, continuous, and hybrid reward structures for fine-tuning LLMs on mathematical reasoning tasks using Qwen3-4B on GSM8K dataset.

DetailsMotivation: Reward design is central to RLHF and alignment research, but current approaches lack systematic comparison of different reward structures (hard, continuous, hybrid) for mathematical reasoning tasks.

Method: Used Qwen3-4B with LoRA fine-tuning on GSM8K dataset, formalized and evaluated reward formulations incorporating correctness, perplexity, reasoning quality, and consistency. Introduced adaptive hybrid reward scheduler that transitions between discrete and continuous signals.

Result: Hybrid reward structures improved convergence speed and training stability compared to purely hard or continuous approaches.

Conclusion: Hybrid reward structures offer better performance for alignment via adaptive reward modeling, balancing exploration and stability during fine-tuning.

Abstract: Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

[999] The Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

Sairam S, Sara Girdhar, Shivam Soni

Main category: cs.LG

TL;DR: The R-Learner framework fails on graph data due to a catastrophic “representation bottleneck” where graph-blind final stages cause complete failure, while graph-aware final stages succeed significantly.

DetailsMotivation: To systematically analyze the R-Learner framework on network data where causal heterogeneity is graph-dependent, challenging its core assumption of a well-specified final-stage model.

Method: Conducted large-scale empirical study with diverse synthetic and semi-synthetic benchmarks, comparing R-Learners with different final-stage CATE estimators and nuisance models, including proposed end-to-end Graph R-Learner.

Result: Found overwhelming statistical evidence (p < 0.001) that R-Learners with graph-blind final stages fail completely (MSE > 4.0), while Graph R-Learner significantly outperforms strong non-DML GNN T-Learner baseline. Identified topology-dependent “nuisance bottleneck” linked to GNN over-squashing.

Conclusion: The primary performance driver is the inductive bias of the final-stage CATE estimator, not nuisance model choice, revealing a critical “final-stage bottleneck” that must be addressed for successful causal inference on graphs.

Abstract: The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data, where causal heterogeneity is often graph-dependent, presents a critical challenge to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale empirical study to systematically dissect the R-Learner framework on graphs. We provide the first rigorous evidence that the primary driver of performance is the inductive bias of the final-stage CATE estimator, an effect that dominates the choice of nuisance models. Our central finding is the quantification of a catastrophic “representation bottleneck”: we prove with overwhelming statistical significance (p < 0.001) that R-Learners with a graph-blind final stage fail completely (MSE > 4.0), even when paired with powerful GNN nuisance models. Conversely, our proposed end-to-end Graph R-Learner succeeds and significantly outperforms a strong, non-DML GNN T-Learner baseline. Furthermore, we identify and provide a mechanistic explanation for a subtle, topology-dependent “nuisance bottleneck,” linking it to GNN over-squashing via a targeted “Hub-Periphery Trade-off” analysis. Our findings are validated across diverse synthetic and semi-synthetic benchmarks. We release our code as a reproducible benchmark to facilitate future research on this critical “final-stage bottleneck.”

[1000] Learning Time-Scale Invariant Population-Level Neural Representations

Eshani Patel, Yisong Yue, Geeling Chau

Main category: cs.LG

TL;DR: TSAP improves neural foundation model robustness to time-scale mismatches through time-scale augmented pretraining.

DetailsMotivation: Existing neural time series models are sensitive to time-scale mismatches between pretraining and downstream tasks, limiting their generalization.

Method: Time-scale Augmented Pretraining (TSAP) that builds invariance to different time-scales in the representation space.

Result: TSAP consistently improves robustness to time-scale variations across decoding tasks and creates more invariant representations.

Conclusion: Handling preprocessing diversity is crucial for building generalizable neural foundation models.

Abstract: General-purpose foundation models for neural time series can help accelerate neuroscientific discoveries and enable applications such as brain computer interfaces (BCIs). A key component in scaling these models is population-level representation learning, which leverages information across channels to capture spatial as well as temporal structure. Population-level approaches have recently shown that such representations can be both efficient to learn on top of pretrained temporal encoders and produce useful representations for decoding a variety of downstream tasks. However, these models remain sensitive to mismatches in preprocessing, particularly on time-scales, between pretraining and downstream settings. We systematically examine how time-scale mismatches affects generalization and find that existing representations lack invariance. To address this, we introduce Time-scale Augmented Pretraining (TSAP), which consistently improves robustness to different time-scales across decoding tasks and builds invariance in the representation space. These results highlight handling preprocessing diversity as a key step toward building generalizable neural foundation models.

[1001] SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

Jiacheng Wang, Yejun Zeng, Jinyang Guo, Yuqing Ma, Aishan Liu, Xianglong Liu

Main category: cs.LG

TL;DR: SLMQuant is the first systematic benchmark for evaluating LLM compression techniques on Small Language Models (SLMs), revealing fundamental differences in quantization sensitivity and providing design principles for SLM-tailored compression.

DetailsMotivation: SLMs face deployment challenges on edge devices due to unresolved efficiency gaps in model compression, with quantization methods for LLMs being underexplored for SLMs despite their different characteristics.

Method: Comprehensive multi-track evaluations across diverse SLM architectures and tasks, analyzing how state-of-the-art quantization methods perform on SLMs compared to LLMs.

Result: Reveals fundamental disparities between SLMs and LLMs in quantization sensitivity, showing that direct transfer of LLM-optimized techniques leads to suboptimal results due to SLMs’ unique architectural characteristics and training dynamics.

Conclusion: SLMQuant establishes a foundational framework for advancing efficient SLM deployment on low-end edge devices and provides critical insights for deploying lightweight language models in resource-constrained scenarios.

Abstract: Despite the growing interest in Small Language Models (SLMs) as resource-efficient alternatives to Large Language Models (LLMs), their deployment on edge devices remains challenging due to unresolved efficiency gaps in model compression. While quantization has proven effective for LLMs, its applicability to SLMs is significantly underexplored, with critical questions about differing quantization bottlenecks and efficiency profiles. This paper introduces SLMQuant, the first systematic benchmark for evaluating LLM compression techniques when applied to SLMs. Through comprehensive multi-track evaluations across diverse architectures and tasks, we analyze how state-of-the-art quantization methods perform on SLMs. Our findings reveal fundamental disparities between SLMs and LLMs in quantization sensitivity, demonstrating that direct transfer of LLM-optimized techniques leads to suboptimal results due to SLMs’ unique architectural characteristics and training dynamics. We identify key factors governing effective SLM quantization and propose actionable design principles for SLM-tailored compression. SLMQuant establishes a foundational framework for advancing efficient SLM deployment on low-end devices in edge applications, and provides critical insights for deploying lightweight language models in resource-constrained scenarios.

[1002] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow

Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

Main category: cs.LG

TL;DR: Proposes a one-step generative policy for offline RL using a residual reformulation of MeanFlow that maps noise directly to actions, enabling expressive multimodal action modeling and stable Q-learning in single-stage training.

DetailsMotivation: Existing one-step Gaussian policies struggle with complex multimodal action distributions, while flow-based methods often require distillation and two-stage training when used with Q-learning.

Method: Reformulates MeanFlow to integrate velocity field and noise-to-action transformation into a single policy network, using a residual formulation for direct noise-to-action generation compatible with Q-learning.

Result: Achieves strong performance on 73 tasks across OGBench and D4RL benchmarks in both offline and offline-to-online RL settings, demonstrating expressive multimodal action modeling.

Conclusion: The proposed method provides efficient one-step generation, expressive multimodal action distribution modeling, and stable single-stage policy learning via Q-learning.

Abstract: We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable direct noise-to-action generation by integrating the velocity field and noise-to-action transformation into a single policy network-eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective residual formulation that supports expressive and stable policy learning. Our method offers three key advantages: 1) efficient one-step noise-to-action generation, 2) expressive modelling of multimodal action distributions, and 3) efficient and stable policy learning via Q-learning in a single-stage training setup. Extensive experiments on 73 tasks across the OGBench and D4RL benchmarks demonstrate that our method achieves strong performance in both offline and offline-to-online reinforcement learning settings. Code is available at https://github.com/HiccupRL/MeanFlowQL.

[1003] Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

Yunhun Nam, Jaehyung Kim, Jongheon Jeong

Main category: cs.LG

TL;DR: LfU is a regularization method for supervised fine-tuning that prevents overfitting by aligning model representations with those after undesirable gradient updates, improving generalization with limited data.

DetailsMotivation: Standard supervised fine-tuning with limited data causes LMs to overfit, rely on spurious patterns, and lose general capabilities. A regularization approach is needed to maintain pretrained knowledge while specializing.

Method: Learning-from-the-Undesirable (LfU) applies consistency regularization by aligning internal representations before and after undesirable gradient ascent updates, using representation-level data augmentation.

Result: LfU achieves 16.8% average improvement on math tasks vs vanilla SFT, prevents performance degradation, and shows 92.1% lower standard deviation in output performance across prompt variations.

Conclusion: LfU effectively regularizes fine-tuning to enhance adaptability while preserving pretrained knowledge, demonstrating improved generalization and robustness across diverse LM tasks.

Abstract: Language models (LMs) are often adapted through supervised fine-tuning (SFT) to specialize their capabilities for downstream tasks. However, in typical scenarios where the fine-tuning data is limited, e.g., compared to pre-training, SFT can lead LMs to overfit, causing them to rely on spurious patterns within the target task or to compromise other broadly useful capabilities as a side effect of narrow specialization. In this paper, we propose Learning-from-the-Undesirable (LfU), a simple yet effective regularization scheme for SFT to mitigate overfitting issues when fine-tuning LMs with limited data. Specifically, we aim to regularize the fine-tuning process to favor solutions that are resilient to “undesirable” model updates, e.g., gradient ascent steps that steer the model toward undesirable behaviors. To this end, we propose a novel form of consistency regularization that directly aligns internal representations of the model with those after an undesirable update. By leveraging representation-level data augmentation through undesirable updates, LfU effectively promotes generalization under limited data. Our experiments on diverse LM downstream tasks show that LfU serves as an effective prior that enhances adaptability while preserving pretrained knowledge. For example, our LM from LfU achieves a 16.8% average improvement on math tasks compared to vanilla SFT on the same dataset, where the latter even leads to degraded performance on those tasks. Furthermore, LfU exhibits improved robustness to prompt variations, e.g., yielding a 92.1% lower standard deviation in output performances compared to SFT, highlighting its versatile effects.

[1004] Bi-View Embedding Fusion: A Hybrid Learning Approach for Knowledge Graph’s Nodes Classification Addressing Problems with Limited Data

Rosario Napoli, Giovanni Lonia, Antonio Celesti, Massimo Villari, Maria Fazio

Main category: cs.LG

TL;DR: Bi-View is a hybrid approach that enhances Graph Embeddings for Knowledge Graphs by combining Node2Vec (structural patterns) and GraphSAGE (neighborhood aggregation) with centrality metrics, improving GML model performance without synthetic data.

DetailsMotivation: Traditional ML requires large datasets, limiting sparse scenarios. GML offers alternatives but faces limitations with Knowledge Graphs that hide semantic information. Need to enhance node features without synthetic data.

Method: Combines Node2Vec (unsupervised random walks for topology) and GraphSAGE (supervised neighborhood aggregation). Enriches node features with centrality metrics, then fuses both embeddings to capture topological and semantic properties.

Result: Improves downstream task performance, especially with poor initial features. Enables more accurate KG-enhanced GML models by exploiting hidden informative features.

Conclusion: Bi-View successfully creates enhanced Graph Embeddings that capture both structural and semantic graph properties, providing a foundation for better GML models in data-sparse scenarios without synthetic data.

Abstract: Traditional Machine Learning (ML) methods require large amounts of data to perform well, limiting their applicability in sparse or incomplete scenarios and forcing the usage of additional synthetic data to improve the model training. To overcome this challenge, the research community is looking more and more at Graph Machine Learning (GML) as it offers a powerful alternative by using relationships within data. However, this method also faces limitations, particularly when dealing with Knowledge Graphs (KGs), which can hide huge information due to their semantic nature. This study introduces Bi-View, a novel hybrid approach that increases the informative content of node features in KGs to generate enhanced Graph Embeddings (GEs) that are used to improve GML models without relying on additional synthetic data. The proposed work combines two complementary GE techniques: Node2Vec, which captures structural patterns through unsupervised random walks, and GraphSAGE, which aggregates neighbourhood information in a supervised way. Node2Vec embeddings are first computed to represent the graph topology, and node features are then enriched with centrality-based metrics, which are used as input for the GraphSAGE model. Moreover, a fusion layer combines the original Node2Vec embeddings with the GraphSAGE-influenced representations, resulting in a dual-perspective embedding space. Such a fusion captures both topological and semantic properties of the graph, enabling the model to exploit informative features that may exist in the dataset but that are not explicitly represented. Our approach improves downstream task performance, especially in scenarios with poor initial features, giving the basis for more accurate and precise KG-enanched GML models.

[1005] Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Antoine Ledent, Mun Chong Soo, Nong Minh Hieu

Main category: cs.LG

TL;DR: The paper studies matrix completion where both ground truth and sampling distribution are low-rank and share a common subspace, using both unlabeled (implicit feedback) and labeled (explicit feedback) data.

DetailsMotivation: Inspired by recommender systems where implicit feedback (clicks, purchases) is abundant but explicit feedback (ratings) is scarce, the paper aims to leverage both data types effectively.

Method: Leveraging low-rank subspace recovery theory and classic matrix completion bounds, the method combines large unlabeled data (M samples) with small labeled data (N samples) under shared subspace assumption.

Result: Derived error bounds scaling as Õ(√(nd/M)) and Õ(√(dr/N)), confirmed in synthetic experiments. Real experiments on Douban and MovieLens show improved performance over explicit-only baselines.

Conclusion: The proposed framework provides a valid theoretical setting for studying explicit-implicit feedback interaction in recommender systems, demonstrating practical benefits of combining both data types.

Abstract: We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount $M$ of \textit{unlabeled} data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the explicit feedback’, consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $\widetilde{O}\left(\sqrt{\frac{nd}{M}}\right)$ and $\widetilde{O}\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $\ground$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

[1006] Latency and Ordering Effects in Online Decisions

Duo Yi

Main category: cs.LG

TL;DR: The paper establishes a structured lower bound for excess benchmark loss in online decision systems with delayed feedback and order-sensitive dynamics, decomposing it into latency penalties, order-sensitivity penalties, their interaction, and a nonconvexity penalty.

DetailsMotivation: Online decision systems operate under delayed feedback and noncommutative dynamics where actions affect observation sequences, requiring theoretical guarantees beyond convex settings.

Method: Proves excess benchmark loss admits a structured lower bound using Bregman divergence, extends to prox-regular and weakly convex settings, and provides operational recipes for estimation via randomized experiments and streaming diagnostics.

Result: Obtained robust guarantees beyond convex case with interpretable lower-bound framework that packages latency, noncommutativity, and implementation-gap effects.

Conclusion: The framework provides a single interpretable lower-bound statement that can be stress-tested and tuned in real-world systems, handling heterogeneous latency and noncommutativity effects.

Abstract: Online decision systems routinely operate under delayed feedback and order-sensitive (noncommutative) dynamics: actions affect which observations arrive, and in what sequence. Taking a Bregman divergence $D_Φ$ as the loss benchmark, we prove that the excess benchmark loss admits a structured lower bound $L \ge L_{\mathrm{ideal}} + g_1(λ) + g_2(\varepsilon_\star) + g_{12}(λ,\varepsilon_\star) - D_{\mathrm{ncx}}$, where $g_1$ and $g_2$ are calibrated penalties for latency and order-sensitivity, $g_{12}$ captures their geometric interaction, and $D_{\mathrm{ncx}}\ge 0$ is a nonconvexity/approximation penalty that vanishes under convex Legendre assumptions. We extend this inequality to prox-regular and weakly convex settings, obtaining robust guarantees beyond the convex case. We also give an operational recipe for estimating and monitoring the four terms via simple $2\times 2$ randomized experiments and streaming diagnostics (effective sample size, clipping rate, interaction heatmaps). The framework packages heterogeneous latency, noncommutativity, and implementation-gap effects into a single interpretable lower-bound statement that can be stress-tested and tuned in real-world systems.

[1007] Self-Organization of Attractor Landscapes in High-Capacity Kernel Logistic Regression Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: Kernel-based Hopfield networks achieve high storage capacity through a geometric optimization mechanism where direct and feedback forces create an “optimization ridge” for maximum attractor stability.

DetailsMotivation: To understand the dynamical mechanism behind the enhanced storage capacity in kernel-based Hopfield networks, which remains poorly understood despite their practical success.

Method: Geometric analysis of the network’s energy landscape using a novel “Pinnacle Sharpness” metric to quantify attractor stability, systematically varying kernel width and storage load, and theoretical decomposition of landscape gradient into direct driving force and indirect feedback force.

Result: Discovery of a rich phase diagram with an “optimization ridge” where attractor stability is maximized under high-load conditions, characterized by strong anti-correlation between direct and feedback forces where the amplified direct force dominates.

Conclusion: Kernel-based Hopfield networks employ sophisticated self-organization through cooperative feedback control, adaptively harnessing inter-pattern interactions to sculpt robust energy landscapes, providing new design principles for high-capacity associative memories.

Abstract: Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by conducting a geometric analysis of the network’s energy landscape. We introduce a novel metric, Pinnacle Sharpness,'' to quantify the local stability of attractors. By systematically varying the kernel width and storage load, we uncover a rich phase diagram of attractor shapes. Our central finding is the emergence of a ridge of optimization,’’ where the network maximizes attractor stability under challenging high-load and global-kernel conditions. Through a theoretical decomposition of the landscape gradient into a direct driving'' force and an indirect feedback’’ force, we reveal the origin of this phenomenon. The optimization ridge corresponds to a regime of strong anti-correlation between the two forces, where the direct force, amplified by the high storage load, dominates the opposing collective feedback force. This demonstrates a sophisticated self-organization mechanism: the network adaptively harnesses inter-pattern interactions as a cooperative feedback control system to sculpt a robust energy landscape. Our findings provide a new physical picture for the stability of high-capacity associative memories and offer principles for their design.

[1008] MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

Vladimír Macko, Vladimír Boža

Main category: cs.LG

TL;DR: MACKO-SpMV is a GPU-optimized format and kernel for efficient sparse matrix-vector multiplication in pruned LLMs, achieving 1.5x memory reduction and 1.2-1.5x speedup over dense representations at 50% sparsity.

DetailsMotivation: Existing SpMV methods perform poorly with the low and unstructured sparsity (30-90%) in pruned LLMs, limiting the benefits of unstructured pruning for memory reduction and speedup.

Method: Co-designed GPU-optimized format and kernel that reduces storage overhead while maintaining compatibility with GPU execution model, without requiring specialized hardware or format-specific precomputation.

Result: At 50% sparsity: 1.5x memory reduction and 1.2-1.5x speedup over dense; 2.8-13.0x over cuSPARSE; 1.9-2.6x over Sputnik; 2.2-2.5x over DASP. Applied to Llama2-7B: 1.5x memory reduction and 1.5x faster inference at fp16.

Conclusion: MACKO makes unstructured pruning at 50% sparsity practical for real-world LLM workloads, justifying its use where previously limited benefits were achieved.

Abstract: Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU’s execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.

[1009] Self-Adaptive Graph Mixture of Models

Mohit Meena, Yash Punjabi, Abhishek A, Vishal Sharma, Mahesh Chandran

Main category: cs.LG

TL;DR: SAGMM is a modular framework that automatically selects and combines diverse GNN models using topology-aware attention gating, outperforming individual models and prior mixture methods across various graph tasks.

DetailsMotivation: GNN performance gains are plateauing, with simple models often matching complex architectures, highlighting the challenge of model selection for different graph tasks and datasets.

Method: Uses architectural diversity with topology-aware attention gating to adaptively assign experts to nodes, includes pruning for efficiency, and offers a training-efficient variant with frozen pretrained experts.

Result: Consistently outperforms or matches leading GNN baselines and prior mixture methods across 16 benchmark datasets for node classification, graph classification, regression, and link prediction.

Conclusion: SAGMM provides a robust, adaptive solution for real-world graph learning by automatically selecting optimal model combinations without compromising performance.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GAT, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning.

[1010] Real-time prediction of breast cancer sites using deformation-aware graph neural network

Kyunghyun Lee, Yong-Min Shin, Minwoo Shin, Jihun Kim, Sunghwan Lim, Won-Yong Shin, Kyungho Yoon

Main category: cs.LG

TL;DR: A graph neural network (GNN) model was developed for real-time prediction of deformed breast cancer sites during MRI-guided biopsy, achieving sub-millimeter accuracy and 4000x speed-up over traditional finite element simulations.

DetailsMotivation: To overcome limitations of direct MRI-guided biopsy (prolonged procedure times, high costs) and improve the accuracy of indirect MRI-guided biopsy by creating an accurate real-time deformable breast model for cancer site prediction.

Method: Developed an individual-specific finite element model using MR image-derived structural information, then employed a GNN model that processes surface displacement and distance-based graph data to predict overall tissue displacement including tumor deformation.

Result: Achieved 0.2 mm accuracy for cancer node displacement (RMSE), dice similarity coefficient of 0.977 for spatial overlap with actual cancerous regions, real-time inference capability, and 4000x computational speed-up compared to conventional FE simulations.

Conclusion: The deformation-aware GNN model provides a promising solution for real-time tumor displacement prediction in breast biopsy with high accuracy and real-time capability, potentially enhancing precision and efficiency of breast cancer diagnosis.

Abstract: Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.

[1011] Fair In-Context Learning via Latent Concept Variables

Karuna Bhaila, Minh-Hao Van, Kennedy Edemacu, Chen Zhao, Feng Chen, Xintao Wu

Main category: cs.LG

TL;DR: This paper investigates bias in LLMs during in-context learning with tabular data and proposes a fair latent variable approach using data augmentation to reduce correlation between predictions and sensitive variables.

DetailsMotivation: With LLMs increasingly used in high-stakes domains, they can inherit social bias from pre-training data, especially when applied to tabular data through in-context learning.

Method: Uses latent concept variables for demonstration selection, employs data augmentation to reduce correlation between outcomes and sensitive variables, and learns concepts with smaller LLMs that generalize to larger ones.

Result: Empirical verification shows the fair latent variable approach improves fairness results on tabular datasets compared to heuristic demonstration selection methods.

Conclusion: The proposed method effectively reduces bias in LLM predictions for tabular data through fair latent concept learning and demonstration selection.

Abstract: The emerging in-context learning (ICL) ability of large language models (LLMs) has prompted their use for predictive tasks in various domains with different data types, including tabular data, facilitated by serialization methods. However, with increasing applications in high-stakes domains, it has been shown that LLMs can inherit social bias and discrimination from their pre-training data. In this work, we investigate inherent bias in LLMs during in-context learning with tabular data. We focus on an optimal demonstration selection approach that utilizes latent concept variables for resource-efficient task adaptation. We design data augmentation strategies that reduce the correlation between predictive outcomes and sensitive variables, helping promote fairness during latent concept learning. We utilize the learned concept to select demonstrations and obtain fair predictions. The latent concept variables are learned using a smaller internal LLM and generalized to larger external LLMs. We empirically verify that the fair latent variable approach improves fairness results on tabular datasets compared to multiple heuristic demonstration selection methods.

[1012] Synthetic Forgetting without Access: A Few-shot Zero-glance Framework for Machine Unlearning

Qipeng Song, Nan Yang, Ziqi Xu, Yue Li, Wei Shao, Feng Xia

Main category: cs.LG

TL;DR: GFOES enables machine unlearning with limited data access using generative feedback and two-phase fine-tuning to erase class-specific knowledge while preserving utility.

DetailsMotivation: Existing machine unlearning methods require full access to original training data, which is impractical. The paper addresses the realistic challenge of few-shot zero-glance unlearning where only minimal retained data is available and forget data is inaccessible.

Method: Proposes GFOES framework with Generative Feedback Network (GFN) that synthesizes Optimal Erasure Samples (OES) to induce high loss on target classes, combined with two-phase fine-tuning: aggressive forgetting followed by utility restoration.

Result: Experiments on three image classification datasets show GFOES achieves effective forgetting at both logit and representation levels while maintaining strong performance using only 5% of original data.

Conclusion: GFOES provides a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions, enabling effective unlearning without access to original forget data.

Abstract: Machine unlearning aims to eliminate the influence of specific data from trained models to ensure privacy compliance. However, most existing methods assume full access to the original training dataset, which is often impractical. We address a more realistic yet challenging setting: few-shot zero-glance, where only a small subset of the retained data is available and the forget set is entirely inaccessible. We introduce GFOES, a novel framework comprising a Generative Feedback Network (GFN) and a two-phase fine-tuning procedure. GFN synthesises Optimal Erasure Samples (OES), which induce high loss on target classes, enabling the model to forget class-specific knowledge without access to the original forget data, while preserving performance on retained classes. The two-phase fine-tuning procedure enables aggressive forgetting in the first phase, followed by utility restoration in the second. Experiments on three image classification datasets demonstrate that GFOES achieves effective forgetting at both logit and representation levels, while maintaining strong performance using only 5% of the original data. Our framework offers a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions.

[1013] Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger Bridges

Changxi Chi, Yufei Huang, Jun Xia, Jiangbin Zheng, Yunfan Liu, Zelin Zang, Stan Z. Li

Main category: cs.LG

TL;DR: The paper proposes a Schrödinger Bridge approximation method using Minibatch-OT pairing to model single-cell perturbation outcomes without requiring paired data, achieving state-of-the-art performance in capturing heterogeneous cellular responses.

DetailsMotivation: Predicting single-cell perturbation outcomes is crucial for gene function analysis and drug discovery, but existing methods struggle with unpaired data limitations and lack precise perturbation modeling capabilities.

Method: Approximates Schrödinger Bridge using Minibatch-OT based pairing to directly align distributions of control and perturbed single-cell populations, avoiding bidirectional inference and ill-posed reverse processes. Models both discrete gene activation states and continuous expression distributions.

Result: Experiments on genetic and drug perturbation datasets show the model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.

Conclusion: The proposed Schrödinger Bridge approximation with Minibatch-OT pairing provides a scalable solution for accurate single-cell perturbation modeling that captures cellular heterogeneity without requiring paired data.

Abstract: Predicting single-cell perturbation outcomes directly advances gene function analysis and facilitates drug candidate selection, making it a key driver of both basic and translational biomedical research. However, a major bottleneck in this task is the unpaired nature of single-cell data, as the same cell cannot be observed both before and after perturbation due to the destructive nature of sequencing. Although some neural generative transport models attempt to tackle unpaired single-cell perturbation data, they either lack explicit conditioning or depend on prior spaces for indirect distribution alignment, limiting precise perturbation modeling. In this work, we approximate Schrödinger Bridge (SB), which defines stochastic dynamic mappings recovering the entropy-regularized optimal transport (OT), to directly align the distributions of control and perturbed single-cell populations across different perturbation conditions. Unlike prior SB approximations that rely on bidirectional modeling to infer optimal source-target sample coupling, we leverage Minibatch-OT based pairing to avoid such bidirectional inference and the associated ill-posedness of defining the reverse process. This pairing directly guides bridge learning, yielding a scalable approximation to the SB. We approximate two SB models, one modeling discrete gene activation states and the other continuous expression distributions. Joint training enables accurate perturbation modeling and captures single-cell heterogeneity. Experiments on public genetic and drug perturbation datasets show that our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.

[1014] Soft Conflict-Resolution Decision Transformer for Offline Multi-Task Reinforcement Learning

Shudong Wang, Xinfei Wang, Chenhao Zhang, Shanchen Pang, Haiyuan Gui, Wenhao Ji, Xiaojian Liao

Main category: cs.LG

TL;DR: SoCo-DT is a soft conflict-resolution method for multi-task reinforcement learning that uses parameter importance and dynamic sparsity adjustment to better handle gradient conflicts across tasks.

DetailsMotivation: Existing masking methods in multi-task RL use coarse-grained binary masks that over-suppress conflicting parameters and use fixed sparsity strategies, which hinders knowledge sharing and generalization across tasks.

Method: SoCo-DT uses Fisher information to dynamically adjust mask values, retains important parameters while suppressing conflicts, and employs IQR-based dynamic sparsity adjustment with asymmetric cosine annealing for adaptive threshold updates.

Result: Experimental results on Meta-World benchmark show SoCo-DT outperforms state-of-the-art by 7.6% on MT50 and 10.5% on suboptimal dataset.

Conclusion: SoCo-DT effectively mitigates gradient conflicts and improves multi-task performance through soft conflict resolution and adaptive sparsity management.

Abstract: Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model’s generalization and learning efficiency. To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.

[1015] PIP: Perturbation-based Iterative Pruning for Large Language Models

Yi Cao, Wei-Jie Xu, Yucheng Shen, Weijie Shi, Chi-Min Chan, Jianfeng Qu, Jiajie Xu

Main category: cs.LG

TL;DR: PIP is a novel double-view structured pruning method that reduces LLM parameters by ~20% while maintaining over 85% accuracy by iteratively pruning components that struggle to distinguish between unperturbed and perturbed views.

DetailsMotivation: The rapid growth in LLM parameter counts (billions to trillions) creates deployment challenges in resource-constrained environments, necessitating efficient model optimization techniques.

Method: PIP combines information from unperturbed and perturbed views, calculates gradient differences, and iteratively prunes components that cannot distinguish between these two views.

Result: PIP reduces parameter count by approximately 20% while retaining over 85% of original model accuracy across benchmarks, with some cases performing within 5% of unpruned versions.

Conclusion: PIP consistently outperforms existing SOTA structured pruning methods and establishes itself as a leading technique for optimizing LLMs in constrained environments.

Abstract: The rapid increase in the parameter counts of Large Language Models (LLMs), which often reach into the billions or even trillions, presents significant challenges for their practical deployment, particularly in resource-constrained environments. To address this issue, we propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize LLMs, which combines information from two different views: the unperturbed view and the perturbed view. With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model’s accuracy across varied benchmarks. In some cases, the performance of the pruned model is within 5% of the unpruned version, demonstrating PIP’s ability to preserve key aspects of model effectiveness. Moreover, PIP consistently outperforms existing state-of-the-art (SOTA) structured pruning methods, establishing it as a leading technique for optimizing LLMs in constrained environments.

[1016] Personalized Federated Learning with Bidirectional Communication Compression via One-Bit Random Sketching

Jiacheng Cheng, Xu Zhang, Guanghui Qiu, Yifang Zhang, Yinchuan Li, Kaiyuan Feng

Main category: cs.LG

TL;DR: pFed1BS is a personalized federated learning framework that uses one-bit random sketching to achieve extreme communication compression while handling client-side data heterogeneity through sign-based regularization.

DetailsMotivation: To address the key challenges in Federated Learning: bidirectional communication overhead and client-side data heterogeneity, while enabling personalized model training for each client.

Method: Uses one-bit random sketching for communication compression, Fast Hadamard Transform for efficient projection, and introduces a sign-based regularizer to align local models with global consensus while preserving local data characteristics.

Result: Theoretical analysis shows convergence to a stationary neighborhood of the global potential function. Numerical simulations demonstrate substantial communication cost reduction with competitive performance compared to advanced communication-efficient FL algorithms.

Conclusion: pFed1BS successfully achieves extreme communication compression while maintaining effective personalization capabilities in federated learning settings.

Abstract: Federated Learning (FL) enables collaborative training across decentralized data, but faces key challenges of bidirectional communication overhead and client-side data heterogeneity. To address communication costs while embracing data heterogeneity, we propose pFed1BS, a novel personalized federated learning framework that achieves extreme communication compression through one-bit random sketching. In personalized FL, the goal shifts from training a single global model to creating tailored models for each client. In our framework, clients transmit highly compressed one-bit sketches, and the server aggregates and broadcasts a global one-bit consensus. To enable effective personalization, we introduce a sign-based regularizer that guides local models to align with the global consensus while preserving local data characteristics. To mitigate the computational burden of random sketching, we employ the Fast Hadamard Transform for efficient projection. Theoretical analysis guarantees that our algorithm converges to a stationary neighborhood of the global potential function. Numerical simulations demonstrate that pFed1BS substantially reduces communication costs while achieving competitive performance compared to advanced communication-efficient FL algorithms.

[1017] OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs

Shaoyuan Chen, Zhixuan Chen, Dawei Yang, Zhihang Yuan, Qiang Wu

Main category: cs.LG

TL;DR: OTARo enables flexible on-device switching of quantization precisions for LLMs through once fine-tuning, using Shared Exponent Floating Point and bit-width robustness learning.

DetailsMotivation: Conventional quantization lacks flexibility for on-device tasks requiring different bit-widths, as scaling factors are incompatible across precisions and cannot support dynamic switching in real-world scenarios.

Method: Introduces Shared Exponent Floating Point (SEFP) for generating different bit-widths via mantissa truncation, and employs bit-width robustness learning with Exploitation-Exploration Bit-Width Path Search and Low-Precision Asynchronous Accumulation.

Result: Experiments on LLaMA3.2-1B and LLaMA3-8B show OTARo achieves consistently strong and robust performance across all precisions.

Conclusion: OTARo successfully overcomes the flexibility limitations of conventional quantization, enabling on-device LLMs to dynamically switch quantization precisions while maintaining performance through a single fine-tuning process.

Abstract: Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.

[1018] Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Erik Schultheis, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh

Main category: cs.LG

TL;DR: Hogwild! Inference enables parallel LLM workers to collaborate via shared attention cache, allowing them to self-organize collaboration strategies without explicit frameworks.

DetailsMotivation: Current parallel LLM frameworks use fixed cooperation strategies (voting, sub-task division) that may not suit all tasks, limiting applicability.

Method: Run multiple LLM instances in parallel with shared KV cache using RoPE, allowing workers to synchronize and decide collaboration strategies organically.

Result: Modern reasoning LLMs can perform inference with shared KV cache without fine-tuning, improving parallel hardware utilization.

Conclusion: LLMs can self-organize effective collaboration through shared attention memory, providing a flexible alternative to rigid parallel frameworks.

Abstract: Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

[1019] Warm-starting active-set solvers using graph neural networks

Ella J. Schmidtobreick, Daniel Arnström, Paul Häusner, Jens Sjölund

Main category: cs.LG

TL;DR: Learning-to-optimize approach using GNNs to predict active sets in QP solver DAQP, reducing iterations and enabling warm-starting for real-time control applications.

DetailsMotivation: QP solvers are computationally expensive for time-critical applications, so accelerating them through learning-based methods is valuable.

Method: Represent QPs as bipartite graphs and use GNNs to predict optimal active sets for warm-starting the DAQP solver.

Result: GNNs consistently reduce solver iterations compared to cold-starting, with performance comparable to MLP baseline, and generalize well to unseen problem sizes.

Conclusion: Structure-aware learning with GNNs shows potential for accelerating optimization in real-time applications like model predictive control.

Abstract: Quadratic programming (QP) solvers are widely used in real-time control and optimization, but their computational cost often limits applicability in time-critical settings. We propose a learning-to-optimize approach using graph neural networks (GNNs) to predict active sets in the dual active-set solver DAQP. The method exploits the structural properties of QPs by representing them as bipartite graphs and learning to identify the optimal active set for efficiently warm-starting the solver. Across varying problem sizes, the GNN consistently reduces the number of solver iterations compared to cold-starting, while performance is comparable to a multilayer perceptron (MLP) baseline. Furthermore, a GNN trained on varying problem sizes generalizes effectively to unseen dimensions, demonstrating flexibility and scalability. These results highlight the potential of structure-aware learning to accelerate optimization in real-time applications such as model predictive control.

[1020] Real-time distortion prediction in metallic additive manufacturing via a physics-informed neural operator approach

Mingxuan Tian, Haochen Mu, Donghong Ding, Mengjiao Li, Yuhan Ding, Jianping Zhao

Main category: cs.LG

TL;DR: A Physics-informed Neural Operator (PINO) model called PIDeepONet-RNN is proposed for real-time distortion field prediction in metal Additive Manufacturing, achieving high accuracy with max errors of 0.9733mm and 0.2049mm in z and y directions respectively.

DetailsMotivation: Real-time distortion prediction is needed for defect control in metal AM, but numerical simulations are computationally expensive and conventional ML models struggle with spatiotemporal features and thermo-mechanical field decoupling.

Method: PIDeepONet-RNN uses trunk and branch networks to process temperature history and encode distortion fields respectively, incorporating heat conduction equation as soft constraint for physical consistency.

Result: The model achieves high accuracy with low error accumulation and time efficiency, with max absolute errors of 0.9733mm (z-direction) and 0.2049mm (y-direction), showing high errors in molten pool but low gradients in deposited areas.

Conclusion: The PINO surrogate model demonstrates potential for real-time long-horizon physics field prediction in defect control for metal Additive Manufacturing.

Abstract: With the development of digital twins and smart manufacturing systems, there is an urgent need for real-time distortion field prediction to control defects in metal Additive Manufacturing (AM). However, numerical simulation methods suffer from high computational cost, long run-times that prevent real-time use, while conventional Machine learning (ML) models struggle to extract spatiotemporal features for long-horizon prediction and fail to decouple thermo-mechanical fields. This paper proposes a Physics-informed Neural Operator (PINO) to predict z and y-direction distortion for the future 15 s. Our method, Physics-informed Deep Operator Network-Recurrent Neural Network (PIDeepONet-RNN) employs trunk and branch network to process temperature history and encode distortion fields, respectively, enabling decoupling of thermo-mechanical responses. By incorporating the heat conduction equation as a soft constraint, the model ensures physical consistency and suppresses unphysical artifacts, thereby establishing a more physically consistent mapping between the thermal history and distortion. This is important because such a basis function, grounded in physical laws, provides a robust and interpretable foundation for predictions. The proposed models are trained and tested using datasets generated from experimentally validated Finite Element Method (FEM). Evaluation shows that the model achieves high accuracy, low error accumulation, time efficiency. The max absolute errors in the z and y-directions are as low as 0.9733 mm and 0.2049 mm, respectively. The error distribution shows high errors in the molten pool but low gradient norms in the deposited and key areas. The performance of PINO surrogate model highlights its potential for real-time long-horizon physics field prediction in controlling defects.

[1021] Uncertainty-aware Physics-informed Neural Networks for Robust CARS-to-Raman Signal Reconstruction

Aishwarya Venkataramanan, Sai Karthikeya Vemuri, Adithya Ashok Chalain Valapil, Joachim Denzler

Main category: cs.LG

TL;DR: Evaluation of uncertainty quantification methods for CARS-to-Raman signal reconstruction, showing physics-informed constraints improve model calibration.

DetailsMotivation: CARS spectroscopy suffers from non-resonant background interference that distorts Raman signals, and existing deep learning methods lack uncertainty quantification needed for reliable deployment in scientific and biomedical applications.

Method: Evaluated various uncertainty quantification techniques for CARS-to-Raman signal reconstruction and incorporated physics-informed constraints (Kramers-Kronig relationships and smoothness) into models.

Result: Physics-informed constraints improved model calibration, providing more trustworthy uncertainty estimates for CARS data analysis.

Conclusion: Integrating physics-informed constraints with uncertainty quantification offers a promising approach for reliable CARS spectroscopy in high-stakes applications.

Abstract: Coherent anti-Stokes Raman scattering (CARS) spectroscopy is a powerful and rapid technique widely used in medicine, material science, and chemical analyses. However, its effectiveness is hindered by the presence of a non-resonant background that interferes with and distorts the true Raman signal. Deep learning methods have been employed to reconstruct the true Raman spectrum from measured CARS data using labeled datasets. A more recent development integrates the domain knowledge of Kramers-Kronig relationships and smoothness constraints in the form of physics-informed loss functions. However, these deterministic models lack the ability to quantify uncertainty, an essential feature for reliable deployment in high-stakes scientific and biomedical applications. In this work, we evaluate and compare various uncertainty quantification (UQ) techniques within the context of CARS-to-Raman signal reconstruction. Furthermore, we demonstrate that incorporating physics-informed constraints into these models improves their calibration, offering a promising path toward more trustworthy CARS data analysis.

[1022] ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao

Main category: cs.LG

TL;DR: ParaDySe is an adaptive parallel strategy switching framework for dynamic sequence training in LLMs that enables on-the-fly optimal strategy selection based on input sequence length to address communication-parallelization cancellation and out-of-memory issues.

DetailsMotivation: Current training frameworks use static parallel strategies for dynamic sequences, causing inefficiency on short sequences (communication-parallelization cancellation) and memory issues on long sequences (out-of-memory).

Method: Implements modular function libraries for parallel strategies with unified tensor layouts, builds sequence-aware memory and time cost models, and uses a heuristic algorithm to select optimal layer-wise strategies for dynamic sequences.

Result: Experimental results show ParaDySe addresses OOM and CPC bottlenecks in LLM training, working with sequence lengths up to 624K and systematically integrating long-sequence optimizations with existing frameworks.

Conclusion: ParaDySe provides an effective solution for dynamic sequence training in LLMs by enabling adaptive parallel strategy switching, overcoming limitations of static approaches and improving training efficiency across varying sequence lengths.

Abstract: Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

[1023] Mitigating Overthinking in Large Reasoning Models via Manifold Steering

Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong

Main category: cs.LG

TL;DR: Proposes Manifold Steering to reduce overthinking in Large Reasoning Models by projecting steering directions onto low-dimensional activation manifolds, achieving up to 71% token reduction while maintaining accuracy.

DetailsMotivation: Large Reasoning Models exhibit overthinking during inference with excessive validation loops and redundant deliberation, causing substantial computational overheads that need mitigation.

Method: Uses mechanistic interpretability to identify overthinking as tied to a low-dimensional manifold, then projects steering directions onto this manifold using theoretical approximation of interference noise.

Result: Reduces output tokens by up to 71% while maintaining and even improving accuracy on mathematical benchmarks, with robust cross-domain transferability to code generation and knowledge-based QA.

Conclusion: Manifold Steering effectively mitigates overthinking in LRMs by leveraging low-dimensional manifold structure, achieving significant computational savings without sacrificing performance.

Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model’s activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.

[1024] DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

Akash Karthikeyan, Yash Vardhan Pant

Main category: cs.LG

TL;DR: DiffFP is a fictitious play framework that uses diffusion policies to learn robust multimodal behaviors in continuous-space zero-sum games, achieving faster convergence and better performance against diverse opponents.

DetailsMotivation: Self-play reinforcement learning struggles with continuous decision spaces, slow convergence to Nash equilibria, and vulnerability to strategic exploitation by unseen opponents.

Method: Proposes DiffFP framework that approximates best responses using diffusion policies with generative modeling to learn adaptive and diverse strategies in fictitious play settings.

Result: Achieves ε-Nash equilibria in continuous-space zero-sum games, with up to 3× faster convergence and 30× higher success rates against RL baselines in racing and multi-particle games.

Conclusion: The diffusion-based fictitious play approach enables robust multimodal policies that outperform baseline methods and demonstrate strong adaptability against diverse opponent strategies.

Abstract: Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $ε$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

[1025] TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye

Main category: cs.LG

TL;DR: TokenSqueeze is a method that reduces token usage in reasoning LLMs by condensing reasoning paths while maintaining accuracy, achieving 50% token reduction without performance loss.

DetailsMotivation: Long chain-of-thought traces in reasoning LLMs increase token usage, leading to higher inference latency and memory consumption, creating a need to balance accuracy and efficiency.

Method: TokenSqueeze uses adaptive reasoning depth selection and distribution-aligned linguistic refinement to condense reasoning paths while preserving logical integrity, relying exclusively on self-generated data.

Result: DeepSeek-R1-Distill-Qwen-7B fine-tuned with TokenSqueeze achieved 50% average token reduction while preserving accuracy on MATH500 benchmark.

Conclusion: TokenSqueeze enables efficient and high-fidelity reasoning without relying on manually curated datasets, providing a practical solution for deploying reasoning LLMs in real-world applications.

Abstract: Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model’s self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at https://github.com/zhangyx1122/TokenSqueeze.

[1026] Laplace Learning in Wasserstein Space

Mary Chriselda Antony Oliver, Michael Roberts, Carola-Bibiane Schönlieb, Matthew Thorpe

Main category: cs.LG

TL;DR: This paper extends graph-based semi-supervised learning from Euclidean spaces to infinite-dimensional Wasserstein spaces using the manifold hypothesis, proving variational convergence and characterizing the Laplace-Beltrami operator.

DetailsMotivation: To investigate graph-based semi-supervised learning methods under the manifold hypothesis by extending classical algorithms from finite-dimensional Euclidean spaces to infinite-dimensional Wasserstein spaces.

Method: Prove variational convergence of discrete graph p-Dirichlet energy to continuum counterpart, characterize Laplace-Beltrami operator on Wasserstein space submanifolds, and validate through numerical experiments on benchmark datasets.

Result: Successfully established theoretical framework for Laplace Learning in Wasserstein spaces and demonstrated consistent classification performance in high-dimensional settings through numerical validation.

Conclusion: The proposed extension of graph-based semi-supervised learning to Wasserstein spaces provides a consistent theoretical framework that performs well in high-dimensional settings, bridging finite-dimensional and infinite-dimensional approaches.

Abstract: The manifold hypothesis posits that high-dimensional data typically resides on low-dimensional sub spaces. In this paper, we assume manifold hypothesis to investigate graph-based semi-supervised learning methods. In particular, we examine Laplace Learning in the Wasserstein space, extending the classical notion of graph-based semi-supervised learning algorithms from finite-dimensional Euclidean spaces to an infinite-dimensional setting. To achieve this, we prove variational convergence of a discrete graph p- Dirichlet energy to its continuum counterpart. In addition, we characterize the Laplace-Beltrami operator on asubmanifold of the Wasserstein space. Finally, we validate the proposed theoretical framework through numerical experiments conducted on benchmark datasets, demonstrating the consistency of our classification performance in high-dimensional settings.

[1027] Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang, Víctor Gutiérrez-Basulto, Jeff Z. Pan

Main category: cs.LG

TL;DR: The paper proposes a comprehensive evaluation framework for multimodal model editing that addresses overfitting issues in existing methods, introduces De-VQA to detect transient blindness, and develops locality-aware adversarial losses to improve editing performance.

DetailsMotivation: Existing multimodal model editing evaluation methods are adapted from textual approaches and overstate success by using low-similarity or random inputs, which obscure overfitting problems.

Method: Proposes a comprehensive locality evaluation framework with three dimensions (random-image, no-image, consistent-image) across seven data types, introduces De-VQA for dynamic evaluation, and develops locality-aware adversarial losses to balance cross-modal representations.

Result: The approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average. Token analysis shows edits disproportionately affect textual tokens.

Conclusion: The proposed framework provides detailed and structured analysis of multimodal edits, effectively addresses transient blindness, and improves editing performance through balanced cross-modal representations.

Abstract: Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

[1028] MorphBoost: Self-Organizing Universal Gradient Boosting with Adaptive Tree Morphing

Boris Kriuk

Main category: cs.LG

TL;DR: MorphBoost introduces self-organizing tree structures in gradient boosting that dynamically adapt splitting criteria during training, achieving state-of-the-art performance with superior consistency and robustness across diverse datasets.

DetailsMotivation: Traditional gradient boosting uses static tree structures with fixed splitting criteria that cannot adapt to evolving gradient distributions and problem-specific characteristics across different learning stages, limiting performance.

Method: Features adaptive split functions that evolve based on gradient statistics and iteration-dependent learning pressures, including morphing split criteria combining gradient-based scores with information-theoretic metrics, automatic problem fingerprinting, vectorized tree prediction, interaction-aware feature importance, and fast-mode optimization.

Result: Outperformed XGBoost by 0.84% on average across 10 datasets, secured overall winner position with 40% win rate, achieved lowest variance (σ=0.0948) and highest minimum accuracy, showing superior consistency and notable improvements on advanced problems.

Conclusion: MorphBoost’s dynamic tree morphing enables automatic adaptation to problem complexity, delivering state-of-the-art performance with enhanced robustness and consistency compared to traditional gradient boosting methods.

Abstract: Traditional gradient boosting algorithms employ static tree structures with fixed splitting criteria that remain unchanged throughout training, limiting their ability to adapt to evolving gradient distributions and problem-specific characteristics across different learning stages. This work introduces MorphBoost, a new gradient boosting framework featuring self-organizing tree structures that dynamically morph their splitting behavior during training. The algorithm implements adaptive split functions that evolve based on accumulated gradient statistics and iteration-dependent learning pressures, enabling automatic adjustment to problem complexity. Key innovations include: (1) morphing split criterion combining gradient-based scores with information-theoretic metrics weighted by training progress; (2) automatic problem fingerprinting for intelligent parameter configuration across binary/multiclass/regression tasks; (3) vectorized tree prediction achieving significant computational speedups; (4) interaction-aware feature importance detecting multiplicative relationships; and (5) fast-mode optimization balancing speed and accuracy. Comprehensive benchmarking across 10 diverse datasets against competitive models (XGBoost, LightGBM, GradientBoosting, HistGradientBoosting, ensemble methods) demonstrates that MorphBoost achieves state-of-the-art performance, outperforming XGBoost by 0.84% on average. MorphBoost secured the overall winner position with 4/10 dataset wins (40% win rate) and 6/30 top-3 finishes (20%), while maintaining the lowest variance (σ=0.0948) and highest minimum accuracy across all models, revealing superior consistency and robustness. Performance analysis across difficulty levels shows competitive results on easy datasets while achieving notable improvements on advanced problems due to higher adaptation levels.

[1029] Seek and You Shall Fold

Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein

Main category: cs.LG

TL;DR: A framework for non-differentiable guidance of protein generative models using genetic algorithms to incorporate experimental data like chemical shifts, distance constraints, and NOE restraints.

DetailsMotivation: Experimental data is essential for accurate protein structure prediction but most predictors are non-differentiable, making them incompatible with gradient-based methods, especially in NMR where chemical shifts are hard to integrate.

Method: Couples continuous diffusion-based protein generator with black-box objectives via a tailored genetic algorithm to handle non-differentiable guidance.

Result: Successfully demonstrated structure generation guided by pairwise distance constraints, NOE restraints, and for the first time chemical shifts, revealing weaknesses in current predictors.

Conclusion: Establishes chemical shift guided structure generation as feasible and provides a general strategy for incorporating diverse experimental data beyond differentiability limits.

Abstract: Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.

[1030] Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification

Alan G. Paredes Cetina, Kaouther Benguessoum, Raoni Lourenço, Sylvain Kubler

Main category: cs.LG

TL;DR: CONFETTI is a novel multi-objective counterfactual explanation method for multivariate time series that balances prediction confidence, proximity, and sparsity to provide actionable insights with minimal changes.

DetailsMotivation: Current deep learning models for multivariate time series lack transparency, and existing explainable AI methods provide only partial insights. Counterfactual explanations are promising but typically prioritize only one objective (accuracy, proximity, or sparsity), limiting their practical value.

Method: CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance three objectives: prediction confidence, proximity to original data, and sparsity of changes.

Result: Evaluated on seven MTS datasets from UEA archive, CONFETTI consistently outperforms state-of-the-art CE methods, achieving ≥10% higher confidence while improving sparsity in ≥40% of cases across six different metrics.

Conclusion: CONFETTI provides more effective and interpretable counterfactual explanations for multivariate time series by simultaneously optimizing multiple objectives, enhancing decision support and transparency in time series analysis.

Abstract: Recent advances in deep learning have improved multivariate time series (MTS) classification and regression by capturing complex patterns, but their lack of transparency hinders decision-making. Explainable AI (XAI) methods offer partial insights, yet often fall short of conveying the full decision space. Counterfactual Explanations (CE) provide a promising alternative, but current approaches typically prioritize either accuracy, proximity or sparsity – rarely all – limiting their practical value. To address this, we propose CONFETTI, a novel multi-objective CE method for MTS. CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance prediction confidence, proximity and sparsity. This method provides actionable insights with minimal changes, improving interpretability, and decision support. CONFETTI is evaluated on seven MTS datasets from the UEA archive, demonstrating its effectiveness in various domains. CONFETTI consistently outperforms state-of-the-art CE methods in its optimization objectives, and in six other metrics from the literature, achieving $\geq10%$ higher confidence while improving sparsity in $\geq40%$.

[1031] Incoherent Beliefs & Inconsistent Actions in Large Language Models

Arka Pal, Teo Kitanovski, Arthur Liang, Akilesh Potti, Micah Goldblum

Main category: cs.LG

TL;DR: LLMs show significant inconsistencies in belief updating and action alignment, with up to 30% difference between elicited posteriors and correct prior updates, and actions often misaligned with held beliefs.

DetailsMotivation: To examine LLM performance in dynamic real-world environments that differ from static evaluation datasets, focusing on coherent belief updating and action-belief consistency.

Method: Analyzed LLMs’ ability to update beliefs coherently and take actions consistent with those beliefs through betting markets and response challenges.

Result: LLMs are largely inconsistent in belief updating and often take actions misaligned with their beliefs, even when models are well-calibrated or achieve high accuracy.

Conclusion: Predicting LLM behavior in complex real-world settings is difficult due to fundamental inconsistencies in belief updating and action alignment.

Abstract: Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.

[1032] Edge-aware baselines for ogbn-proteins in PyTorch Geometric: species-wise normalization, post-hoc calibration, and cost-accuracy trade-offs

Aleksandar Stanković, Dejan Lisica

Main category: cs.LG

TL;DR: Reproducible edge-aware baselines for ogbn-proteins using GraphSAGE with sum-based edge-to-node features, comparing normalization methods and showing post-hoc calibration improvements.

DetailsMotivation: To establish reproducible baselines for ogbn-proteins dataset by studying key system choices in edge feature aggregation and message passing that dominate practice.

Method: Used GraphSAGE with different edge-to-node feature aggregation methods (sum, mean, max), compared LayerNorm, BatchNorm, and species-aware Conditional LayerNorm, and applied post-hoc calibration techniques including temperature scaling and label-correlation smoothing.

Result: Sum aggregation consistently outperformed mean and max; BatchNorm achieved best AUC while Conditional LayerNorm matched AUC frontier with better thresholded F1; post-hoc calibration substantially improved micro-F1 and ECE with negligible AUC change.

Conclusion: Established strong reproducible baselines for ogbn-proteins, identified optimal system configurations, and demonstrated significant improvements through post-hoc calibration techniques.

Abstract: We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG). We study two system choices that dominate practice: (i) how 8-dimensional edge evidence is aggregated into node inputs, and (ii) how edges are used inside message passing. Our strongest baseline is GraphSAGE with sum-based edge-to-node features. We compare LayerNorm (LN), BatchNorm (BN), and a species-aware Conditional LayerNorm (CLN), and report compute cost (time, VRAM, parameters) together with accuracy (ROC-AUC) and decision quality. In our primary experimental setup (hidden size 512, 3 layers, 3 seeds), sum consistently beats mean and max; BN attains the best AUC, while CLN matches the AUC frontier with better thresholded F1. Finally, post-hoc per-label temperature scaling plus per-label thresholds substantially improves micro-F1 and expected calibration error (ECE) with negligible AUC change, and light label-correlation smoothing yields small additional gains. We release standardized artifacts and scripts used for all of the runs presented in the paper.

[1033] Interpreting the Effects of Quantization on LLMs

Manpreet Singh, Hassan Sajjad

Main category: cs.LG

TL;DR: Quantization has minor impact on LLM reliability - model calibration remains stable, dead neuron counts unchanged, and neuron contribution patterns vary by model size but no drastic changes observed.

DetailsMotivation: To investigate how quantization affects internal representations and reliability of LLMs, as this impact remains understudied despite quantization's practical benefits for deployment.

Method: Used interpretability techniques to analyze multiple LLMs under 4-bit and 8-bit quantization, examining model calibration, neuron activations, dead neuron counts, and neuron contribution patterns.

Result: Quantization has minimal effect on model calibration; dead neuron counts remain consistent; smaller models have fewer salient neurons while larger models have more (except Llama-2-7B); neuron redundancy effects vary by model.

Conclusion: Quantization effects vary by model and tasks, but no drastic changes were observed that would discourage using quantization as a reliable model compression technique.

Abstract: Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

[1034] Explainable RL Policies by Distilling to Locally-Specialized Linear Policies with Voronoi State Partitioning

Senne Deproost, Dennis Steckelmacher, Ann Nowé

Main category: cs.LG

TL;DR: The paper proposes a method using Voronoi partitioning to create explainable linear models that mimic deep RL controllers by dividing the state space into regions where simpler models can operate effectively.

DetailsMotivation: Deep RL controllers lack transparency, which poses challenges for regulations and trust. Knowledge distillation can transfer learned behavior to human-readable models, but single models struggle in dynamic situations and need the right balance between flexibility and complexity.

Method: A model-agnostic method using Voronoi partitioning to divide the state space into regions where simplified, human-understandable linear models can operate effectively while achieving similar performance to the original controller.

Result: The approach was evaluated on a gridworld environment and classic control task. The distillation to locally-specialized linear models produces explainable policies that match or even slightly outperform the black-box policy they are distilled from.

Conclusion: Voronoi partitioning enables effective knowledge distillation from deep RL controllers to explainable linear models, providing transparency while maintaining or improving performance.

Abstract: Deep Reinforcement Learning is one of the state-of-the-art methods for producing near-optimal system controllers. However, deep RL algorithms train a deep neural network, that lacks transparency, which poses challenges when the controller has to meet regulations, or foster trust. To alleviate this, one could transfer the learned behaviour into a model that is human-readable by design using knowledge distilla- tion. Often this is done with a single model which mimics the original model on average but could struggle in more dynamic situations. A key challenge is that this simpler model should have the right balance be- tween flexibility and complexity or right balance between balance bias and accuracy. We propose a new model-agnostic method to divide the state space into regions where a simplified, human-understandable model can operate in. In this paper, we use Voronoi partitioning to find regions where linear models can achieve similar performance to the original con- troller. We evaluate our approach on a gridworld environment and a classic control task. We observe that our proposed distillation to locally- specialized linear models produces policies that are explainable and show that the distillation matches or even slightly outperforms the black-box policy they are distilled from.

[1035] Tab-PET: Graph-Based Positional Encodings for Tabular Transformers

Yunze Leng, Rohan Ghosh, Mehul Motani

Main category: cs.LG

TL;DR: The paper shows that positional encodings (PEs) can improve tabular transformer performance by reducing feature dimensionality, and proposes Tab-PET framework using graph-based PEs.

DetailsMotivation: Tabular data lacks inherent structural cues that vision/language models exploit, and existing tabular transformers don't use positional encodings despite their potential benefits for generalization.

Method: Proposed Tab-PET framework that estimates positional encodings using graph-based approaches (association-based and causality-based) and incorporates them into tabular transformer embeddings.

Result: Graph-derived PEs significantly improved performance across 50 classification/regression datasets for TabTransformer, SAINT, and FT-Transformer models, with association-based graphs showing more stable gains.

Conclusion: Positional encodings play an unexpected but valuable role in tabular transformers by reducing effective feature rank and improving generalization, with graph-based approaches providing effective structural cues.

Abstract: Supervised learning with tabular data presents unique challenges, including low data sizes, the absence of structural cues, and heterogeneous features spanning both categorical and continuous domains. Unlike vision and language tasks, where models can exploit inductive biases in the data, tabular data lacks inherent positional structure, hindering the effectiveness of self-attention mechanisms. While recent transformer-based models like TabTransformer, SAINT, and FT-Transformer (which we refer to as 3T) have shown promise on tabular data, they typically operate without leveraging structural cues such as positional encodings (PEs), as no prior structural information is usually available. In this work, we find both theoretically and empirically that structural cues, specifically PEs can be a useful tool to improve generalization performance for tabular transformers. We find that PEs impart the ability to reduce the effective rank (a form of intrinsic dimensionality) of the features, effectively simplifying the task by reducing the dimensionality of the problem, yielding improved generalization. To that end, we propose Tab-PET (PEs for Tabular Transformers), a graph-based framework for estimating and inculcating PEs into embeddings. Inspired by approaches that derive PEs from graph topology, we explore two paradigms for graph estimation: association-based and causality-based. We empirically demonstrate that graph-derived PEs significantly improve performance across 50 classification and regression datasets for 3T. Notably, association-based graphs consistently yield more stable and pronounced gains compared to causality-driven ones. Our work highlights an unexpected role of PEs in tabular transformers, revealing how they can be harnessed to improve generalization.

[1036] Statistically Accurate and Robust Generative Prediction of Rock Discontinuities with A Tabular Foundation Model

Han Meng, Gang Mei, Hong Tian, Nengxiong Xu, Jianbing Peng

Main category: cs.LG

TL;DR: A robust approach using tabular foundation models for generative prediction of rock discontinuities that outperforms conventional methods in accuracy and robustness.

DetailsMotivation: Rock discontinuities are critical for rock mass stability but their internal distributions are unobservable. Surface observations are sparse, and existing generative approaches fail to capture complex patterns or lack robustness with limited data.

Method: Utilizes a tabular foundation model with powerful sample learning capability specifically designed for small data, enabling effective capture of complex distribution patterns from limited measured discontinuities.

Result: Comparative experiments on ten diverse datasets show superior accuracy and robustness over conventional statistical models and deep generative approaches.

Conclusion: This work advances quantitative characterization of rock mass structures, supporting safer and more reliable data-driven geotechnical design.

Abstract: Rock discontinuities critically govern the mechanical behavior and stability of rock masses. Their internal distributions remain largely unobservable and are typically inferred from surface-exposed discontinuities using generative prediction approaches. However, surface-exposed observations are inherently sparse, and existing generative prediction approaches either fail to capture the underlying complex distribution patterns or lack robustness under data-sparse conditions. Here, we proposed a simple yet robust approach for statistically accurate generative prediction of rock discontinuities by utilizing a tabular foundation model. By leveraging the powerful sample learning capability of the foundation model specifically designed for small data, our approach can effectively capture the underlying complex distribution patterns within limited measured discontinuities. Comparative experiments on ten datasets with diverse scales and distribution patterns of discontinuities demonstrate superior accuracy and robustness over conventional statistical models and deep generative approaches. This work advances quantitative characterization of rock mass structures, supporting safer and more reliable data-driven geotechnical design.

[1037] Dual-LoRA and Quality-Enhanced Pseudo Replay for Multimodal Continual Food Learning

Xinlan Wu, Bin Zhu, Feng Han, Pengkun Jiao, Jingjing Chen

Main category: cs.LG

TL;DR: Proposes a continual learning framework for multimodal food analysis using Dual-LoRA architecture and Quality-Enhanced Pseudo Replay to prevent catastrophic forgetting when learning new tasks.

DetailsMotivation: Existing large multimodal models in food analysis suffer from catastrophic forgetting when learning new tasks, requiring costly retraining from scratch.

Method: Dual-LoRA architecture with two complementary adapters per task: specialized LoRA for task-specific knowledge with orthogonal constraints, and cooperative LoRA for shared knowledge via pseudo replay. Quality-Enhanced Pseudo Replay uses self-consistency and semantic similarity to reduce hallucinations.

Result: Experiments on Uni-Food dataset show superior performance in mitigating forgetting, representing the first effective continual learning approach for complex food tasks.

Conclusion: The proposed framework effectively addresses catastrophic forgetting in multimodal food learning, enabling efficient continual learning without costly retraining.

Abstract: Food analysis has become increasingly critical for health-related tasks such as personalized nutrition and chronic disease prevention. However, existing large multimodal models (LMMs) in food analysis suffer from catastrophic forgetting when learning new tasks, requiring costly retraining from scratch. To address this, we propose a novel continual learning framework for multimodal food learning, integrating a Dual-LoRA architecture with Quality-Enhanced Pseudo Replay. We introduce two complementary low-rank adapters for each task: a specialized LoRA that learns task-specific knowledge with orthogonal constraints to previous tasks’ subspaces, and a cooperative LoRA that consolidates shared knowledge across tasks via pseudo replay. To improve the reliability of replay data, our Quality-Enhanced Pseudo Replay strategy leverages self-consistency and semantic similarity to reduce hallucinations in generated samples. Experiments on the comprehensive Uni-Food dataset show superior performance in mitigating forgetting, representing the first effective continual learning approach for complex food tasks.

[1038] A Novel Hierarchical Integration Method for Efficient Model Merging in Medical LLMs

Prakrit Timilsina, Anuj Nepal, Rajan Kadel, Robin Doss

Main category: cs.LG

TL;DR: Systematic evaluation of parameter merging techniques for medical LLMs shows simple averaging methods outperform complex approaches for architecturally compatible models, offering efficient knowledge consolidation for distributed healthcare AI.

DetailsMotivation: Address challenges in distributed healthcare AI: consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model updates.

Method: Evaluated six parameter-space merging techniques on two architecturally compatible medical LLMs from Mistral-7B base model. Introduced novel hierarchical method combining selective Optimal Transport alignment for attention layers with cosine similarity-weighted interpolation.

Result: Simple averaging methods outperformed complex approaches. Task Arithmetic achieved 45.80% accuracy on MedQA, showing architecturally compatible models benefit significantly from simple averaging over complex pruning-based methods.

Conclusion: For architecturally compatible models, simple averaging provides robust and computationally efficient baseline for knowledge consolidation, offering pragmatic path for scalable medical AI systems in resource-constrained IoT environments.

Abstract: Large Language Models (LLMs) face significant challenges in distributed healthcare, including consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model updates.This paper presents a systematic evaluation of six parameter-space merging techniques applied to two architecturally compatible medical LLMs derived from the Mistral-7B base model. We introduce a novel hierarchical method that combines selective Optimal Transport (OT) alignment for attention layers with cosine similarity-weighted interpolation, designed to address permutation variance while minimizing computational overhead for edge deployment scenarios. Our study evaluates Task Arithmetic, Linear Averaging, DARE-TIES, DELLA, Breadcrumbs, and our Hierarchical approach across five medical benchmarks. Results demonstrate that architecturally compatible models benefit significantly from simple averaging methods, with Task Arithmetic achieving 45.80% accuracy on MedQA, outperforming complex pruning-based approaches. These findings offer critical insights for the deployment of distributed medical AI in resource-constrained IoT environments, where computational efficiency and model compatibility are paramount. Our work establishes that for architecturally compatible models, simple averaging provides a robust and computationally efficient baseline for knowledge consolidation, offering a pragmatic path forward for scalable medical AI systems.

[1039] Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Yuan Cheng, Yuan Qi, Yaodong Yang

Main category: cs.LG

TL;DR: PackingStar uses game-theoretic reinforcement learning to solve the Kissing Number Problem, achieving new records in dimensions 25-31 and discovering over 6000 new structures.

DetailsMotivation: The Kissing Number Problem has remained challenging since 1694 due to high-dimensional geometric irregularities and exponential combinatorial complexity beyond 8 dimensions, limiting scalability of traditional methods.

Method: Modeled as a two-player matrix completion game where one player fills entries representing pairwise cosines of sphere center vectors and another corrects suboptimal ones, using cooperative dynamics to maximize matrix size corresponding to kissing number.

Result: Reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, achieves first breakthrough beyond rational structures from 1971 in 13 dimensions, and discovers over 6000 new structures in 14 and other dimensions.

Conclusion: Demonstrates AI’s power to explore high-dimensional spaces beyond human intuition and opens new pathways for the Kissing Number Problem and broader geometry problems.

Abstract: Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert’s 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in 14 and other dimensions. These results demonstrate AI’s power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.

[1040] Fast and Robust Simulation-Based Inference With Optimization Monte Carlo

Vasilis Gkolemis, Christos Diou, Michael Gutmann

Main category: cs.LG

TL;DR: A new gradient-based method for Bayesian parameter inference in complex stochastic simulators that reformulates simulation as deterministic optimization, achieving accurate posterior inference with significantly reduced runtime compared to state-of-the-art approaches.

DetailsMotivation: Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions, and existing methods become costly in high-dimensional parameter spaces or with partially uninformative outputs.

Method: Building on Optimization Monte Carlo framework, the approach reformulates stochastic simulation as deterministic optimization problems and uses gradient-based methods to efficiently navigate toward high-density posterior regions. A JAX-based implementation enhances performance through vectorization.

Result: Extensive experiments show the method consistently matches or exceeds state-of-the-art accuracy in high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors, while substantially reducing runtime.

Conclusion: The proposed method delivers accurate posterior inference for differentiable simulators with significantly reduced computational costs, making it particularly effective for challenging inference problems.

Abstract: Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions. Existing simulation-based inference methods often require large number of simulations and become costly to use in high-dimensional parameter spaces or in problems with partially uninformative outputs. We propose a new method for differentiable simulators that delivers accurate posterior inference with substantially reduced runtimes. Building on the Optimization Monte Carlo framework, our approach reformulates stochastic simulation as deterministic optimization problems. Gradient-based methods are then applied to efficiently navigate toward high-density posterior regions and avoid wasteful simulations in low-probability areas. A JAX-based implementation further enhances the performance through vectorization of key method components. Extensive experiments, including high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors show that our method consistently matches, and often exceeds, the accuracy of state-of-the-art approaches, while reducing the runtime by a substantial margin.

[1041] PAST: A Primary-Auxiliary Spatio-Temporal Network for Traffic Time Series Imputation

Hanwen Hu, Zimo Wen, Shiyou Qian, Jian Co

Main category: cs.LG

TL;DR: PAST network for traffic time series imputation handles diverse missing data types by categorizing patterns into primary (internal relationships) and auxiliary (external factors), achieving state-of-the-art performance.

DetailsMotivation: Existing models struggle with random missing positions and fail to learn long-term dependencies needed for extensive missing conditions in traffic data.

Method: Proposed PAST network with Graph-Integrated Module (GIM) for primary patterns using dynamic graphs with interval-aware dropout, and Cross-Gated Module (CGM) for auxiliary patterns using bidirectional gating on external features.

Result: Outperforms 7 baselines by up to 26.2% in RMSE and 31.6% in MAE across 27 missing data conditions on three datasets.

Conclusion: PAST effectively handles various missing data types by modeling both primary and auxiliary patterns through interactive modules under self-supervised training.

Abstract: Traffic time series imputation is crucial for the safety and reliability of intelligent transportation systems, while diverse types of missing data, including random, fiber, and block missing make the imputation task challenging. Existing models often focus on disentangling and separately modeling spatial and temporal patterns based on relationships between data points. However, these approaches struggle to adapt to the random missing positions, and fail to learn long-term and large-scale dependencies, which are essential in extensive missing conditions. In this paper, patterns are categorized into two types to handle various missing data conditions: primary patterns, which originate from internal relationships between data points, and auxiliary patterns, influenced by external factors like timestamps and node attributes. Accordingly, we propose the Primary-Auxiliary Spatio-Temporal network (PAST). It comprises a graph-integrated module (GIM) and a cross-gated module (CGM). GIM captures primary patterns via dynamic graphs with interval-aware dropout and multi-order convolutions, and CGM extracts auxiliary patterns through bidirectional gating on embedded external features. The two modules interact via shared hidden vectors and are trained under an ensemble self-supervised framework. Experiments on three datasets under 27 missing data conditions demonstrate that the imputation accuracy of PAST outperforms seven state-of-the-art baselines by up to 26.2% in RMSE and 31.6% in MAE.

[1042] MMWSTM-ADRAN+: A Novel Hybrid Deep Learning Architecture for Enhanced Climate Time Series Forecasting and Extreme Event Prediction

Shaheen Mohammed Saleh Ahmed, Hakan Hakan Guneyli

Main category: cs.LG

TL;DR: MMWSTM-ADRAN+ is a dual-stream deep learning model for extreme temperature forecasting that combines weather regime dynamics with anomaly detection using attention mechanisms.

DetailsMotivation: Accurate short-range prediction of extreme air temperature events is crucial for climate-risk management but remains challenging.

Method: Dual-stream architecture with MMWSTM (BiLSTM + Markov transitions for weather regimes) and ADRAN (BiGRU + attention + anomaly amplification), fused via attentive gate. Uses ExtremeWeatherLoss and data augmentation.

Result: The model is designed to capture synoptic-scale weather regime changes and enhance sensitivity to low-probability extreme temperature signals.

Conclusion: The proposed architecture addresses the fundamental challenge of extreme temperature forecasting through regime-aware dynamics and anomaly-focused attention mechanisms.

Abstract: Accurate short-range prediction of extreme air temperature events remains a fundamental challenge in operational climate-risk management. We present Multi-Modal Weather State Transition Model with Anomaly-Driven Recurrent Attention Network Plus (MMWSTM-ADRAN+), a dual-stream deep learning architecture that couples a regime-aware dynamics model with an anomaly-focused attention mechanism to forecast daily maximum temperature and its extremes. The first stream, MMWSTM, combines bidirectional Long Short-Term Memory (BiLSTM) units with a learnable Markov state transition matrix to capture synoptic-scale weather regime changes. The second stream, ADRAN, integrates bidirectional Gated Recurrent Units (BiGRUs), multi-head self-attention, and a novel anomaly amplification layer to enhance sensitivity to low-probability signals. A lightweight attentive fusion gate adaptively determines the contribution of each stream to the final prediction. Model optimization employs a custom ExtremeWeatherLoss function that up-weights errors on the upper 5% and lower 5% of the temperature distribution, and a time-series data augmentation suite (jittering, scaling, time/magnitude warping) that effectively quadruples the training data

[1043] Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

Main category: cs.LG

TL;DR: The paper analyzes how training LLMs for multiple epochs on limited data affects scaling laws, introducing the concept of ’effective reuse rate’ E(K,N) to quantify how much larger a dataset must be for one-pass training to match K-epoch performance.

DetailsMotivation: While data scaling laws for LLMs are well-studied in one-pass training with massive data, the impact of limited data and repeated epochs remains unexplored, particularly how multiple epochs reshape scaling laws.

Method: Theoretical analysis of SGD in linear regression under strong convexity or Zipf-distributed data, defining effective reuse rate E(K,N) as the multiplicative factor by which dataset must grow under one-pass training to achieve same test loss as K-epoch training.

Result: Two key findings: (1) For small K, E(K,N) ≈ K (linear gain per epoch); (2) For larger K, E(K,N) plateaus at problem-dependent value that grows with N (Θ(log N) for strongly-convex case), showing larger datasets can be repeated more before marginal benefit vanishes.

Conclusion: The maximum K for which E(K,N) ≈ K depends on data size and distribution, contradicting recent empirical claims that 4 epochs yield negligible loss differences. Results highlight need to explicitly model both factors in scaling law studies with data reuse.

Abstract: While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

[1044] Discovering Operational Patterns Using Image-Based Convolutional Clustering and Composite Evaluation: A Case Study in Foundry Melting Processes

Zhipeng Ma, Bo Nørregaard Jørgensen, Zheng Grace Ma

Main category: cs.LG

TL;DR: A novel framework for unsupervised discovery of operational modes in univariate time-series data using image-based convolutional clustering with composite internal evaluation, achieving superior performance in industrial process monitoring.

DetailsMotivation: Industrial process monitoring faces challenges with unlabeled sensor data, high variability, and operational noise. Existing methods are limited in handling dynamic, unstructured industrial sequences due to fixed distance metrics or models designed for static data.

Method: Transforms raw time-series into grayscale matrix representations via overlapping sliding windows, uses deep convolutional autoencoder for feature extraction, integrates soft/hard clustering with two-stage refinement, and evaluates with composite score S_eva combining multiple indices.

Result: Applied to 3900+ furnace melting operations, identified seven explainable operational patterns with significant differences in energy consumption, thermal dynamics, and production duration. Outperformed classical and deep clustering baselines with superior performance, robustness, and domain-aligned explainability.

Conclusion: The framework addresses key challenges in unsupervised time-series analysis including sequence irregularity, overlapping modes, and metric inconsistency, providing a generalizable solution for data-driven diagnostics and energy optimization in industrial systems.

Abstract: Industrial process monitoring increasingly relies on sensor-generated time-series data, yet the lack of labels, high variability, and operational noise make it difficult to extract meaningful patterns using conventional methods. Existing clustering techniques either rely on fixed distance metrics or deep models designed for static data, limiting their ability to handle dynamic, unstructured industrial sequences. Addressing this gap, this paper proposes a novel framework for unsupervised discovery of operational modes in univariate time-series data using image-based convolutional clustering with composite internal evaluation. The proposed framework improves upon existing approaches in three ways: (1) raw time-series sequences are transformed into grayscale matrix representations via overlapping sliding windows, allowing effective feature extraction using a deep convolutional autoencoder; (2) the framework integrates both soft and hard clustering outputs and refines the selection through a two-stage strategy; and (3) clustering performance is objectively evaluated by a newly developed composite score, S_eva, which combines normalized Silhouette, Calinski-Harabasz, and Davies-Bouldin indices. Applied to over 3900 furnace melting operations from a Nordic foundry, the method identifies seven explainable operational patterns, revealing significant differences in energy consumption, thermal dynamics, and production duration. Compared to classical and deep clustering baselines, the proposed approach achieves superior overall performance, greater robustness, and domain-aligned explainability. The framework addresses key challenges in unsupervised time-series analysis, such as sequence irregularity, overlapping modes, and metric inconsistency, and provides a generalizable solution for data-driven diagnostics and energy optimization in industrial systems.

[1045] Hardware optimization on Android for inference of AI models

Iulius Gherasim, Carlos García Sánchez

Main category: cs.LG

TL;DR: This paper analyzes optimal execution configurations for AI models on Android systems, focusing on object detection (YOLO) and image classification (ResNet) tasks to find the best trade-off between accuracy and inference speed.

DetailsMotivation: The integration of AI models into mobile computing requires minimal latency and high responsiveness, but faces challenges in execution strategies that leverage real-time constraints and heterogeneous hardware architecture.

Method: The research evaluates various model quantization schemes and utilization of on-device accelerators (GPU and NPU) for YOLO object detection and ResNet image classification models on Android systems.

Result: The study empirically determines the optimal combination of configurations that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

Conclusion: The paper provides optimal execution configurations for AI models on Android systems that balance accuracy and performance through proper quantization and hardware acceleration strategies.

Abstract: The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

[1046] Artificial Intelligence-Enabled Spirometry for Early Detection of Right Heart Failure

Bin Liu, Qinghao Zhao, Yuxi Zhou, Zhejun Sun, Kaijie Lei, Deyun Zhang, Shijia Geng, Shenda Hong

Main category: cs.LG

TL;DR: Self-supervised representation learning method using spirogram time series and demographic data for early detection of right heart failure in patients with cor pulmonale, achieving AUROC of 0.7501-0.8413 across different patient populations.

DetailsMotivation: Right heart failure has high morbidity and mortality, and lung disease often causes increased right ventricular load leading to RHF. Early detection is crucial for patients with cor pulmonale who develop RHF from underlying lung diseases.

Method: Two-stage approach: 1) Self-supervised representation learning using VAE-encoder to learn low-dimensional representations from data-augmented spirogram time series, 2) Fusion of these representations with demographic information into CatBoost classifier for RHF prediction.

Result: Achieved AUROC of 0.7501 on UK Biobank subset (26,617 individuals), 0.8194 on CKD patients (n=74), and 0.8413 on VHD patients (n=64), demonstrating strong performance especially in high-risk clinical subgroups.

Conclusion: The self-supervised representation learning approach combining spirogram time series and demographic data shows promising potential for early RHF detection in clinical practice, particularly for high-risk populations.

Abstract: Right heart failure (RHF) is a disease characterized by abnormalities in the structure or function of the right ventricle (RV), which is associated with high morbidity and mortality. Lung disease often causes increased right ventricular load, leading to RHF. Therefore, it is very important to screen out patients with cor pulmonale who develop RHF from people with underlying lung diseases. In this work, we propose a self-supervised representation learning method to early detecting RHF from patients with cor pulmonale, which uses spirogram time series to predict patients with RHF at an early stage. The proposed model is divided into two stages. The first stage is the self-supervised representation learning-based spirogram embedding (SLSE) network training process, where the encoder of the Variational autoencoder (VAE-encoder) learns a robust low-dimensional representation of the spirogram time series from the data-augmented unlabeled data. Second, this low-dimensional representation is fused with demographic information and fed into a CatBoost classifier for the downstream RHF prediction task. Trained and tested on a carefully selected subset of 26,617 individuals from the UK Biobank, our model achieved an AUROC of 0.7501 in detecting RHF, demonstrating strong population-level distinction ability. We further evaluated the model on high-risk clinical subgroups, achieving AUROC values of 0.8194 on a test set of 74 patients with chronic kidney disease (CKD) and 0.8413 on a set of 64 patients with valvular heart disease (VHD). These results highlight the model’s potential utility in predicting RHF among clinically elevated-risk populations. In conclusion, this study presents a self-supervised representation learning approach combining spirogram time series and demographic data, demonstrating promising potential for early RHF detection in clinical practice.

[1047] Multi-task GINN-LP for Multi-target Symbolic Regression

Hussein Rajabu, Lijun Qian, Xishuang Dong

Main category: cs.LG

TL;DR: MTRGINN-LP is an interpretable neural network for multi-target symbolic regression that addresses limitations of traditional SR methods by handling interdependent multi-output problems while maintaining interpretability.

DetailsMotivation: Symbolic Regression faces two key challenges: limited generalization from scientific datasets with well-understood relationships, and inability to handle multi-target outputs with interdependent variables that are common in real-world problems.

Method: Proposes multi-task regression GINN-LP (MTRGINN-LP) that integrates GINN-LP with multi-task deep learning, using a shared backbone with multiple power-term approximator blocks and task-specific output layers to capture inter-target dependencies while preserving interpretability.

Result: Validated on practical multi-target applications including energy efficiency prediction and sustainable agriculture, demonstrating competitive predictive performance alongside high interpretability.

Conclusion: Effectively extends symbolic regression to broader real-world multi-output tasks by combining interpretable mathematical expressions with multi-task learning capabilities.

Abstract: In the area of explainable artificial intelligence, Symbolic Regression (SR) has emerged as a promising approach by discovering interpretable mathematical expressions that fit data. However, SR faces two main challenges: most methods are evaluated on scientific datasets with well-understood relationships, limiting generalization, and SR primarily targets single-output regression, whereas many real-world problems involve multi-target outputs with interdependent variables. To address these issues, we propose multi-task regression GINN-LP (MTRGINN-LP), an interpretable neural network for multi-target symbolic regression. By integrating GINN-LP with a multi-task deep learning, the model combines a shared backbone including multiple power-term approximator blocks with task-specific output layers, capturing inter-target dependencies while preserving interpretability. We validate multi-task GINN-LP on practical multi-target applications, including energy efficiency prediction and sustainable agriculture. Experimental results demonstrate competitive predictive performance alongside high interpretability, effectively extending symbolic regression to broader real-world multi-output tasks.

[1048] AdamX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

Meng Zhu, Quan Xiao, Weidong Min

Main category: cs.LG

TL;DR: AdamX is a new optimization algorithm that improves upon Adam by introducing a novel second-order moment estimation exponential decay rate that gradually reduces learning step correction strength, eventually degrading to SGD for better training stability and generalization.

DetailsMotivation: Adam optimization tends to converge to non-flat minima compared to SGD-based algorithms, which can negatively impact generalization performance in large language models.

Method: Proposed AdamX with a novel second-order moment estimation exponential decay rate that weakens learning step correction as training progresses and degrades to SGD during stable training periods.

Result: Experimental results show AdamX’s second-order moment estimation exponential decay rate outperforms current methods, and AdamX consistently outperforms Adam and its variants in performance.

Conclusion: AdamX provides better training stability and potentially enhanced generalization by addressing Adam’s tendency to converge to non-flat minima through gradual degradation to SGD-like behavior.

Abstract: Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.

[1049] GREAT: Generalizable Representation Enhancement via Auxiliary Transformations for Zero-Shot Environmental Prediction

Shiyuan Luo, Chonghao Qiu, Runlong Yu, Yiqun Xie, Xiaowei Jia

Main category: cs.LG

TL;DR: GREAT framework improves environmental model generalization to unseen regions through constrained data augmentation that preserves physical relationships and temporal coherence.

DetailsMotivation: Environmental modeling struggles with limited and imbalanced data across regions, leading to models that learn spurious local patterns rather than generalizable physical relationships.

Method: GREAT learns transformation functions at multiple neural network layers to augment environmental features and temporal influence, using a bi-level training process that constrains augmented data to preserve original source patterns.

Result: GREAT significantly outperforms existing methods in zero-shot stream temperature prediction across six ecologically diverse watersheds in the eastern U.S.

Conclusion: GREAT provides a practical solution for environmental applications where comprehensive monitoring is infeasible, enabling better predictions in completely unseen regions.

Abstract: Environmental modeling faces critical challenges in predicting ecosystem dynamics across unmonitored regions due to limited and geographically imbalanced observation data. This challenge is compounded by spatial heterogeneity, causing models to learn spurious patterns that fit only local data. Unlike conventional domain generalization, environmental modeling must preserve invariant physical relationships and temporal coherence during augmentation. In this paper, we introduce Generalizable Representation Enhancement via Auxiliary Transformations (GREAT), a framework that effectively augments available datasets to improve predictions in completely unseen regions. GREAT guides the augmentation process to ensure that the original governing processes can be recovered from the augmented data, and the inclusion of the augmented data leads to improved model generalization. Specifically, GREAT learns transformation functions at multiple layers of neural networks to augment both raw environmental features and temporal influence. They are refined through a novel bi-level training process that constrains augmented data to preserve key patterns of the original source data. We demonstrate GREAT’s effectiveness on stream temperature prediction across six ecologically diverse watersheds in the eastern U.S., each containing multiple stream segments. Experimental results show that GREAT significantly outperforms existing methods in zero-shot scenarios. This work provides a practical solution for environmental applications where comprehensive monitoring is infeasible.

[1050] Quantum Machine Learning via Contrastive Training

Liudmila A. Zhukas, Vivian Ni Zhang, Qiang Miao, Qingfeng Wang, Marko Cetina, Jungsang Kim, Lawrence Carin, Christopher Monroe

Main category: cs.LG

TL;DR: Self-supervised pretraining of quantum representations reduces labeled data dependency by learning invariances from unlabeled examples, implemented on a trapped-ion quantum computer with improved classification accuracy and stability.

DetailsMotivation: Address the challenge of labeled data scarcity in quantum machine learning models, similar to classical ML, especially as model scale and complexity increase.

Method: Implement self-supervised pretraining on a programmable trapped-ion quantum computer, encoding images as quantum states and using in situ contrastive pretraining on hardware with quantum overlaps for similarity measurement.

Result: Fine-tuned models show higher mean test accuracy and lower run-to-run variability in image classification, with especially significant improvements in limited labeled data regimes. Learned invariances generalize beyond pretraining samples.

Conclusion: Establishes a label-efficient route to quantum representation learning with direct relevance to quantum-native datasets and clear path to larger classical inputs, with all training and classification executed on hardware.

Abstract: Quantum machine learning (QML) has attracted growing interest with the rapid parallel advances in large-scale classical machine learning and quantum technologies. Similar to classical machine learning, QML models also face challenges arising from the scarcity of labeled data, particularly as their scale and complexity increase. Here, we introduce self-supervised pretraining of quantum representations that reduces reliance on labeled data by learning invariances from unlabeled examples. We implement this paradigm on a programmable trapped-ion quantum computer, encoding images as quantum states. In situ contrastive pretraining on hardware yields a representation that, when fine-tuned, classifies image families with higher mean test accuracy and lower run-to-run variability than models trained from random initialization. Performance improvement is especially significant in regimes with limited labeled training data. We show that the learned invariances generalize beyond the pretraining image samples. Unlike prior work, our pipeline derives similarity from measured quantum overlaps and executes all training and classification stages on hardware. These results establish a label-efficient route to quantum representation learning, with direct relevance to quantum-native datasets and a clear path to larger classical inputs.

[1051] Naga: Vedic Encoding for Deep State Space Models

Melanie Schaller, Nick Janssen, Bodo Rosenhahn

Main category: cs.LG

TL;DR: Naga is a deep State Space Model that uses bidirectional processing of forward and time-reversed sequences with Vedic-inspired element-wise interactions, achieving state-of-the-art performance on long-term time series forecasting benchmarks.

DetailsMotivation: To enhance temporal dependency capture in long-range sequence modeling by drawing inspiration from Vedic mathematics structural concepts, providing an interpretable and efficient alternative to existing approaches.

Method: Bidirectional representation processing with joint forward and time-reversed sequence analysis, combined through Hadamard (element-wise) interaction for Vedic-inspired encoding in deep State Space Models.

Result: Outperforms 28 state-of-the-art models on multiple LTSF benchmarks (ETTh1, ETTh2, ETTm1, ETTm2, Weather, Traffic, ILI) and shows improved efficiency compared to existing deep SSM approaches.

Conclusion: Vedic-inspired structured decomposition offers an effective, interpretable, and computationally efficient framework for long-range sequence modeling in time series forecasting.

Abstract: This paper presents Naga, a deep State Space Model (SSM) encoding approach inspired by structural concepts from Vedic mathematics. The proposed method introduces a bidirectional representation for time series by jointly processing forward and time-reversed input sequences. These representations are then combined through an element-wise (Hadamard) interaction, resulting in a Vedic-inspired encoding that enhances the model’s ability to capture temporal dependencies across distant time steps. We evaluate Naga on multiple long-term time series forecasting (LTSF) benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Traffic, and ILI. The experimental results show that Naga outperforms 28 current state of the art models and demonstrates improved efficiency compared to existing deep SSM-based approaches. The findings suggest that incorporating structured, Vedic-inspired decomposition can provide an interpretable and computationally efficient alternative for long-range sequence modeling.

[1052] A Quantum Tensor Network-Based Viewpoint for Modeling and Analysis of Time Series Data

Pragatheeswaran Vipulananthan, Kamal Premaratne, Dilip Sarkar, Manohar N. Murthi

Main category: cs.LG

TL;DR: A quantum physics-based white box method for accurate uncertainty quantification and enhanced interpretability in machine learning, using kernel mean embedding mapped to a 1D spin chain Hamiltonian and perturbation theory.

DetailsMotivation: To address the trade-off between neural networks' performance and lack of interpretability versus probabilistic models' interpretability but performance gap, by developing a method that offers both accurate uncertainty quantification and enhanced interpretability.

Method: Map kernel mean embedding of time series data to reproducing kernel Hilbert space, construct tensor network-inspired 1D spin chain Hamiltonian with KME as eigen-function/mode, solve Schrödinger equation and apply perturbation theory for uncertainty quantification.

Result: Demonstrated effectiveness in change point detection and time series clustering compared to state-of-the-art white box models, providing insights into decision-making uncertainties.

Conclusion: The proposed quantum physics-based white box method successfully bridges the gap between performance and interpretability, offering accurate uncertainty quantification while maintaining model transparency.

Abstract: Accurate uncertainty quantification is a critical challenge in machine learning. While neural networks are highly versatile and capable of learning complex patterns, they often lack interpretability due to their black box'' nature. On the other hand, probabilistic white box’’ models, though interpretable, often suffer from a significant performance gap when compared to neural networks. To address this, we propose a novel quantum physics-based white box'' method that offers both accurate uncertainty quantification and enhanced interpretability. By mapping the kernel mean embedding (KME) of a time series data vector to a reproducing kernel Hilbert space (RKHS), we construct a tensor network-inspired 1D spin chain Hamiltonian, with the KME as one of its eigen-functions or eigen-modes. We then solve the associated Schr{ö}dinger equation and apply perturbation theory to quantify uncertainty, thereby improving the interpretability of tasks performed with the quantum tensor network-based model. We demonstrate the effectiveness of this methodology, compared to state-of-the-art white box" models, in change point detection and time series clustering, providing insights into the uncertainties associated with decision-making throughout the process.

[1053] Mitigating Spurious Correlations in Patch-wise Tumor Classification on High-Resolution Multimodal Images

Ihab Asaad, Maha Shadaydeh, Joachim Denzler

Main category: cs.LG

TL;DR: Patch-wise binary classification for tumor detection suffers from spurious correlations between patch composition and labels, but applying GERNE debiasing improves worst-group accuracy by ~7% compared to ERM.

DetailsMotivation: Patch-wise classification reduces annotation costs and simplifies training for high-resolution images, but introduces spurious correlations where tumor patches tend to have larger tissue areas while non-tumor patches have more background.

Method: Apply GERNE debiasing method to maximize worst-group accuracy (WGA) in patch-wise binary tumor classification, addressing spurious correlations between patch composition and labels.

Result: GERNE improves WGA by approximately 7% compared to ERM across different thresholds for binarizing the spurious feature, enhancing performance on critical minority cases.

Conclusion: Spurious correlation-aware learning is crucial for patch-wise classification problems, and debiasing strategies like GERNE effectively mitigate bias and improve model robustness on challenging cases.

Abstract: Patch-wise multi-label classification provides an efficient alternative to full pixel-wise segmentation on high-resolution images, particularly when the objective is to determine the presence or absence of target objects within a patch rather than their precise spatial extent. This formulation substantially reduces annotation cost, simplifies training, and allows flexible patch sizing aligned with the desired level of decision granularity. In this work, we focus on a special case, patch-wise binary classification, applied to the detection of a single class of interest (tumor) on high-resolution multimodal nonlinear microscopy images. We show that, although this simplified formulation enables efficient model development, it can introduce spurious correlations between patch composition and labels: tumor patches tend to contain larger tissue regions, whereas non-tumor patches often consist mostly of background with small tissue areas. We further quantify the bias in model predictions caused by this spurious correlation, and propose to use a debiasing strategy to mitigate its effect. Specifically, we apply GERNE, a debiasing method that can be adapted to maximize worst-group accuracy (WGA). Our results show an improvement in WGA by approximately 7% compared to ERM for two different thresholds used to binarize the spurious feature. This enhancement boosts model performance on critical minority cases, such as tumor patches with small tissues and non-tumor patches with large tissues, and underscores the importance of spurious correlation-aware learning in patch-wise classification problems.

[1054] Fairness-Aware Graph Representation Learning with Limited Demographic Information

Zichong Wang, Zhipeng Yin, Liping Yang, Jun Zhuang, Rui Yu, Qingzhao Kong, Wenbin Zhang

Main category: cs.LG

TL;DR: FairGLite is a novel fair graph learning framework that mitigates bias using partial demographic data through proxy generation, consistent embedding enforcement, and adaptive confidence strategies with theoretical guarantees.

DetailsMotivation: Most existing fair graph learning methods require full demographic information, which is rarely available in practice due to privacy, legal, or regulatory restrictions.

Method: Proposes proxy generation using partial demographic data, enforces consistent node embeddings across demographic groups, and implements adaptive confidence strategy that dynamically adjusts node contributions based on prediction confidence.

Result: Extensive experiments on multiple datasets show effectiveness in mitigating bias while maintaining model utility, with theoretical analysis proving upper bounds on group fairness metrics.

Conclusion: FairGLite provides a practical solution for fair graph learning with limited demographic information, offering formal guarantees for bias mitigation without compromising utility.

Abstract: Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node’s contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework’s effectiveness in both mitigating bias and maintaining model utility.

[1055] Graph Out-of-Distribution Detection via Test-Time Calibration with Dual Dynamic Dictionaries

Yue Hou, Ruomei Liu, Yingke Su, Junran Wu, Ke Xu

Main category: cs.LG

TL;DR: BaCa is a test-time graph OOD detection method that uses dual dynamic dictionaries to calibrate OOD scores without fine-tuning pre-trained models, generating boundary-aware topologies through graphon estimation and mix-up strategies.

DetailsMotivation: Existing graph OOD detection methods are limited by the absence of ground-truth OOD samples during training and fail to capture distributional boundaries effectively. The latent structure of graph data governed by multiple factors remains underexplored.

Method: Proposes BaCa method that estimates graphons and applies mix-up strategy with test samples to generate boundary-aware topologies. Uses dual dynamic dictionaries with priority queues and attention mechanisms to capture latent ID and OOD representations for score calibration.

Result: Extensive experiments on real-world datasets show BaCa significantly outperforms existing state-of-the-art methods in OOD detection.

Conclusion: BaCa provides an effective test-time OOD detection approach that eliminates the need for auxiliary datasets and fine-tuning, achieving superior performance through boundary-aware score calibration.

Abstract: A key challenge in graph out-of-distribution (OOD) detection lies in the absence of ground-truth OOD samples during training. Existing methods are typically optimized to capture features within the in-distribution (ID) data and calculate OOD scores, which often limits pre-trained models from representing distributional boundaries, leading to unreliable OOD detection. Moreover, the latent structure of graph data is often governed by multiple underlying factors, which remains less explored. To address these challenges, we propose a novel test-time graph OOD detection method, termed BaCa, that calibrates OOD scores using dual dynamically updated dictionaries without requiring fine-tuning the pre-trained model. Specifically, BaCa estimates graphons and applies a mix-up strategy solely with test samples to generate diverse boundary-aware discriminative topologies, eliminating the need for exposing auxiliary datasets as outliers. We construct dual dynamic dictionaries via priority queues and attention mechanisms to adaptively capture latent ID and OOD representations, which are then utilized for boundary-aware OOD score calibration. To the best of our knowledge, extensive experiments on real-world datasets show that BaCa significantly outperforms existing state-of-the-art methods in OOD detection.

[1056] RAC-DMVC: Reliability-Aware Contrastive Deep Multi-View Clustering under Multi-Source Noise

Shihao Dong, Yue Liu, Xiaotong Zhou, Yuhui Zheng, Huiying Xu, Xinzhong Zhu

Main category: cs.LG

TL;DR: RAC-DMVC is a novel framework for multi-view clustering that handles both missing and observation noise through reliability-aware contrastive learning, cross-view reconstruction, dual-attention imputation, and self-supervised cluster distillation.

DetailsMotivation: To enhance multi-view clustering's applicability in real-world scenarios by addressing the challenging problem of multi-source noises (missing noise and observation noise) that commonly exist in practical applications.

Method: Proposes RAC-DMVC framework with: 1) reliability graph construction for robust representation learning, 2) cross-view reconstruction for observation noise handling, 3) reliability-aware noise contrastive learning to mitigate bias in positive/negative pairs selection, 4) dual-attention imputation for missing noise handling, and 5) self-supervised cluster distillation for representation refinement.

Result: Extensive experiments on five benchmark datasets show RAC-DMVC outperforms state-of-the-art methods on multiple evaluation metrics and maintains excellent performance under varying ratios of noise.

Conclusion: RAC-DMVC effectively addresses multi-source noise challenges in multi-view clustering through its reliability-aware framework and demonstrates superior performance compared to existing methods in noisy environments.

Abstract: Multi-view clustering (MVC), which aims to separate the multi-view data into distinct clusters in an unsupervised manner, is a fundamental yet challenging task. To enhance its applicability in real-world scenarios, this paper addresses a more challenging task: MVC under multi-source noises, including missing noise and observation noise. To this end, we propose a novel framework, Reliability-Aware Contrastive Deep Multi-View Clustering (RAC-DMVC), which constructs a reliability graph to guide robust representation learning under noisy environments. Specifically, to address observation noise, we introduce a cross-view reconstruction to enhances robustness at the data level, and a reliability-aware noise contrastive learning to mitigates bias in positive and negative pairs selection caused by noisy representations. To handle missing noise, we design a dual-attention imputation to capture shared information across views while preserving view-specific features. In addition, a self-supervised cluster distillation module further refines the learned representations and improves the clustering performance. Extensive experiments on five benchmark datasets demonstrate that RAC-DMVC outperforms SOTA methods on multiple evaluation metrics and maintains excellent performance under varying ratios of noise.

[1057] Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization

Kaichi Irie, Shuhei Watanabe, Masaki Onishi

Main category: cs.LG

TL;DR: The paper identifies suboptimal performance in Bayesian optimization due to off-diagonal approximation errors in quasi-Newton methods when optimizing batched acquisition functions, and proposes a coroutine-based approach that maintains theoretical convergence while reducing wall-clock time.

DetailsMotivation: Current Bayesian optimization libraries like BoTorch use batched acquisition function optimization for speed, but this leads to suboptimal performance due to off-diagonal approximation errors in quasi-Newton methods, slowing convergence.

Method: Proposed decoupling quasi-Newton updates using coroutines while maintaining batched acquisition function calls, achieving theoretically identical convergence to sequential multi-start optimization.

Result: The approach drastically reduces wall-clock time compared to previous methods while maintaining the same theoretical convergence properties as sequential optimization.

Conclusion: The coroutine-based method successfully addresses the computational bottleneck in Bayesian optimization by optimizing quasi-Newton updates without sacrificing the benefits of batched function evaluations.

Abstract: Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches.

[1058] Towards Multimodal Representation Learning in Paediatric Kidney Disease

Ana Durica, John Booth, Ivana Drobnjak

Main category: cs.LG

TL;DR: A recurrent neural network model using longitudinal lab data and demographics can predict abnormal creatinine levels in children within 30 days.

DetailsMotivation: Pediatric kidney disease requires continuous monitoring, and there's a need to leverage electronic health records for early detection of renal function deterioration.

Method: Used recurrent neural networks trained on longitudinal laboratory sequences and demographic data from electronic health records (2019-2025) to predict abnormal serum creatinine values.

Result: The model successfully demonstrated that simple temporal representations can capture useful patterns in routine pediatric data for predicting renal function changes.

Conclusion: This pilot study provides initial validation for temporal modeling approaches in pediatric nephrology and establishes groundwork for future multimodal extensions with additional clinical signals.

Abstract: Paediatric kidney disease varies widely in its presentation and progression, which calls for continuous monitoring of renal function. Using electronic health records collected between 2019 and 2025 at Great Ormond Street Hospital, a leading UK paediatric hospital, we explored a temporal modelling approach that integrates longitudinal laboratory sequences with demographic information. A recurrent neural model trained on these data was used to predict whether a child would record an abnormal serum creatinine value within the following thirty days. Framed as a pilot study, this work provides an initial demonstration that simple temporal representations can capture useful patterns in routine paediatric data and lays the groundwork for future multimodal extensions using additional clinical signals and more detailed renal outcomes.

[1059] Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou

Main category: cs.LG

TL;DR: The paper analyzes systematic distributional discrepancies in mixed real-synthetic datasets for LLMs, identifies three-phase scaling behavior, derives a generalization bound, and proposes an efficient data valuation method that outperforms state-of-the-art approaches.

DetailsMotivation: Synthetic data in LLM training introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms, posing challenges in evaluating mixed real-synthetic datasets.

Method: Identified three-phase scaling behavior with two breakpoints reflecting transitions in model behavior, derived an LLM generalization bound for real-synthetic mixtures, and proposed an efficient data valuation method that scales to large datasets.

Result: Comprehensive experiments across four tasks (image classification, sentiment classification, instruction following, complex reasoning) show the proposed method surpasses state-of-the-art baselines in data valuation with significantly lower computational cost.

Conclusion: The study provides theoretical insights into mixed real-synthetic dataset behavior and offers an effective, efficient data valuation method that addresses distributional discrepancies in LLM training.

Abstract: The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

[1060] FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs

Aleksandar Stanković

Main category: cs.LG

TL;DR: FuseSampleAgg is a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for GraphSAGE, achieving up to 51x speedup and 100x memory reduction.

DetailsMotivation: To eliminate the overhead of block materialization and extra kernel launches in GraphSAGE training by fusing neighbor sampling and mean aggregation operations.

Method: Developed a CUDA operator that performs neighbor sampling and mean aggregation in a single pass, preserving GraphSAGE mean semantics through saved index replay.

Result: Achieved step time speedups up to 51x on ogbn-products, 4x on Reddit, and 3.3x on ogbn-arxiv, with peak GPU memory reductions up to 100x, 36x, and 3.5x respectively.

Conclusion: FuseSampleAgg significantly improves GraphSAGE training efficiency by reducing memory traffic and kernel launch overhead while maintaining deterministic behavior and compatibility with standard PyTorch optimizers.

Abstract: We present FuseSampleAgg, a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for one and two hop GraphSAGE. By eliminating block materialization and extra kernel launches, FuseSampleAgg reduces memory traffic and overhead while preserving GraphSAGE mean semantics via saved index replay. Across the Reddit, ogbn-arxiv, and ogbn-products benchmarks (batch size 1024, automatic mixed precision enabled), we observe step time speedups up to 51x on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv at larger fanouts, with peak GPU memory reductions up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic, integrates with standard PyTorch optimizers, and ships with scripts that reproduce all tables and figures from CSV logs. Code and scripts are available at https://github.com/SV25-22/FuseSampleAgg.

[1061] Weight-sparse transformers have interpretable circuits

Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, Dan Mossing

Main category: cs.LG

TL;DR: Training sparse neural networks with mostly zero weights to create interpretable circuits that correspond to natural concepts, enabling human-understandable mechanistic analysis of language models.

DetailsMotivation: To find human-understandable circuits in language models and advance mechanistic interpretability by making model components more transparent and interpretable.

Method: Train models with constrained sparse weights (mostly zeros) so each neuron has few connections, then prune models to isolate task-specific circuits. Study scaling effects and adapt the approach to explain existing dense models.

Result: Produces circuits containing neurons and residual channels that correspond to natural concepts with straightforward interpretable connections. Sparse models trade capability for interpretability, and scaling improves the capability-interpretability frontier.

Conclusion: The method achieves unprecedented human understandability of circuits and validates them rigorously, though scaling sparse models beyond tens of millions of parameters while preserving interpretability remains challenging.

Abstract: Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.

[1062] Tuning for Two Adversaries: Enhancing the Robustness Against Transfer and Query-Based Attacks using Hyperparameter Tuning

Pascal Zimmer, Ghassan Karame

Main category: cs.LG

TL;DR: Analysis shows learning rate has opposite effects on robustness: lower rates help against transfer attacks, higher rates help against query attacks. Distributed training benefits most from hyperparameter tuning.

DetailsMotivation: To understand how optimization hyperparameters influence robustness against different types of adversarial attacks (transfer-based and query-based) in various practical deployment settings.

Method: Theoretical analysis and experiments across centralized training, ensemble learning, and distributed training, examining effects of learning rate, weight decay, momentum, and batch size on robustness.

Result: Learning rate shows opposite effects: decreasing it enhances robustness against transfer-based attacks by up to 64%, while increasing it improves robustness against query-based attacks by up to 28%. Distributed models benefit most from hyperparameter tuning.

Conclusion: Optimization hyperparameters significantly impact adversarial robustness in different ways for different attack types. Distributed training setups achieve the best tradeoff for simultaneously mitigating both transfer-based and query-based attacks through hyperparameter tuning.

Abstract: In this paper, we present the first detailed analysis of how optimization hyperparameters – such as learning rate, weight decay, momentum, and batch size – influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to $64%$. In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to $28%$ across various settings and data distributions. Leveraging these findings, we explore – for the first time – the optimization hyperparameter design space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.

[1063] Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers

Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep, Asif Ekbal

Main category: cs.LG

TL;DR: SSRGNet combines Graph Neural Networks and protein Language Models to predict protein secondary structures from primary sequences, leveraging 3D structural data through protein residue graphs and outperforming baselines on f1-scores.

DetailsMotivation: Existing methods for predicting secondary structures from protein sequences don't effectively utilize available 3D structural data, which is crucial for understanding protein functions. The goal is to bridge this gap by incorporating spatial information.

Method: Uses protein residue graphs with sequential/structural connections, combines pre-trained transformer-based protein language models for sequence encoding with GNNs (GCN and R-GCN) for geometric structure capture. Employs convolutional layers on nearby regions with relations to learn spatial insights.

Result: Extensive experiments show SSRGNet surpasses baseline methods on f1-scores using the NetSurfP-2.0 dataset for 3-state and 8-state secondary structure prediction.

Conclusion: The proposed approach successfully integrates 3D structural information with sequence data through graph neural networks and language models, demonstrating improved performance in protein secondary structure prediction.

Abstract: In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node’s nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein’s spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model’s performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.

[1064] Scientific Data Compression and Super-Resolution Sampling

Minh Vu, Andrey Lokhov

Main category: cs.LG

TL;DR: A novel framework for scientific data compression and super-resolution using learning exponential families, preserving physical quantities with uncertainty quantification and flexible compression-reconstruction trade-offs.

DetailsMotivation: Address the challenge of massive scientific datasets exceeding storage and processing limits, while enabling data recovery with physical feature preservation for applications like checkpointing and restarting simulations.

Method: Grounded in recent advances in learning exponential families, the framework supports compression and super-resolution with uncertainty quantification in physical quantities of interest.

Result: The method enables efficient data reduction while preserving essential physical features and supporting recovery from compressed representations with guarantees on physical characteristics.

Conclusion: The introduced framework provides a solution for managing massive scientific datasets through compression and super-resolution, with uncertainty preservation and flexible trade-offs between compression and reconstruction quality.

Abstract: Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.

[1065] ST-ProC: A Graph-Prototypical Framework for Robust Semi-Supervised Travel Mode Identification

Luyao Niu, Nuoxian Huang

Main category: cs.LG

TL;DR: ST-ProC is a graph-prototypical multi-objective SSL framework for travel mode identification that addresses label scarcity by combining graph regularization, prototypical anchoring, and margin-aware pseudo-labeling with contrastive and consistency losses.

DetailsMotivation: Travel mode identification from GPS trajectories suffers from high annotation costs and label scarcity. Existing semi-supervised learning methods are unsuitable due to catastrophic confirmation bias and ignoring intrinsic data manifold structure.

Method: Proposes ST-ProC framework with graph-prototypical core using graph regularization, prototypical anchoring, and margin-aware pseudo-labeling to reject noise, supported by contrastive and teacher-student consistency losses for robust optimization.

Result: ST-ProC significantly outperforms all baselines, achieving 21.5% performance improvement over state-of-the-art methods like FixMatch in real-world sparse-label settings.

Conclusion: The proposed framework effectively addresses label scarcity in travel mode identification through synergistic combination of manifold exploitation and foundational SSL support, demonstrating superior performance in practical applications.

Abstract: Travel mode identification (TMI) from GPS trajectories is critical for urban intelligence, but is hampered by the high cost of annotation, leading to severe label scarcity. Prevailing semi-supervised learning (SSL) methods are ill-suited for this task, as they suffer from catastrophic confirmation bias and ignore the intrinsic data manifold. We propose ST-ProC, a novel graph-prototypical multi-objective SSL framework to address these limitations. Our framework synergizes a graph-prototypical core with foundational SSL Support. The core exploits the data manifold via graph regularization, prototypical anchoring, and a novel, margin-aware pseudo-labeling strategy to actively reject noise. This core is supported and stabilized by foundational contrastive and teacher-student consistency losses, ensuring high-quality representations and robust optimization. ST-ProC outperforms all baselines by a significant margin, demonstrating its efficacy in real-world sparse-label settings, with a performance boost of 21.5% over state-of-the-art methods like FixMatch.

[1066] Cross-Learning from Scarce Data via Multi-Task Constrained Optimization

Leopoldo Agorio, Juan Cerviño, Miguel Calvo-Fullana, Alejandro Ribeiro, Juan Andrés Bazerque

Main category: cs.LG

TL;DR: Cross-learning framework enables joint parameter estimation across related tasks to overcome data scarcity by transferring knowledge from data-rich to data-poor tasks.

DetailsMotivation: Traditional learning methods fail when data is limited, leading to poor generalization. This paper addresses the fundamental problem of parameter inference from scarce data by leveraging related tasks.

Method: Formulated as constrained optimization where parameters across tasks are jointly estimated with constraints controlling similarity between models, allowing parameter differences while combining information from multiple sources.

Result: Theoretical guarantees provided for Gaussian data, with empirical validation showing improved accuracy in image classification and infectious disease propagation applications.

Conclusion: Cross-learning framework provides effective solution for parameter inference in data-scarce scenarios by enabling knowledge transfer across related tasks through constrained joint estimation.

Abstract: A learning task, understood as the problem of fitting a parametric model from supervised data, fundamentally requires the dataset to be large enough to be representative of the underlying distribution of the source. When data is limited, the learned models fail generalize to cases not seen during training. This paper introduces a multi-task \emph{cross-learning} framework to overcome data scarcity by jointly estimating \emph{deterministic} parameters across multiple, related tasks. We formulate this joint estimation as a constrained optimization problem, where the constraints dictate the resulting similarity between the parameters of the different models, allowing the estimated parameters to differ across tasks while still combining information from multiple data sources. This framework enables knowledge transfer from tasks with abundant data to those with scarce data, leading to more accurate and reliable parameter estimates, providing a solution for scenarios where parameter inference from limited data is critical. We provide theoretical guarantees in a controlled framework with Gaussian data, and show the efficiency of our cross-learning method in applications with real data including image classification and propagation of infectious diseases.

[1067] Efficient Calibration for Decision Making

Parikshit Gopalan, Konstantinos Stavropoulos, Kunal Talwar, Pranay Tankala

Main category: cs.LG

TL;DR: The paper introduces CDL_K, a tractable variant of the calibration decision loss measure that restricts post-processing to structured function families K, and provides theoretical guarantees for its computational tractability.

DetailsMotivation: The original calibration decision loss (CDL) is intractable to approximate, motivating the need for a more practical measure that maintains theoretical rigor while being computationally feasible.

Method: Define CDL_K by restricting post-processing to structured families K, develop theory for when CDL_K is tractable, and prove bounds for natural classes of post-processing functions.

Result: Established conditions under which CDL_K is information-theoretically and computationally tractable, with both upper and lower bounds for natural function classes.

Conclusion: The approach provides rigorous guarantees for widely used recalibration procedures and introduces new algorithmic techniques to calibration theory for decision making.

Abstract: A decision-theoretic characterization of perfect calibration is that an agent seeking to minimize a proper loss in expectation cannot improve their outcome by post-processing a perfectly calibrated predictor. Hu and Wu (FOCS'24) use this to define an approximate calibration measure called calibration decision loss ($\mathsf{CDL}$), which measures the maximal improvement achievable by any post-processing over any proper loss. Unfortunately, $\mathsf{CDL}$ turns out to be intractable to even weakly approximate in the offline setting, given black-box access to the predictions and labels. We suggest circumventing this by restricting attention to structured families of post-processing functions $K$. We define the calibration decision loss relative to $K$, denoted $\mathsf{CDL}_K$ where we consider all proper losses but restrict post-processings to a structured family $K$. We develop a comprehensive theory of when $\mathsf{CDL}_K$ is information-theoretically and computationally tractable, and use it to prove both upper and lower bounds for natural classes $K$. In addition to introducing new definitions and algorithmic techniques to the theory of calibration for decision making, our results give rigorous guarantees for some widely used recalibration procedures in machine learning.

[1068] From Black Box to Insight: Explainable AI for Extreme Event Preparedness

Kiana Vu, İsmet Selçuk Özer, Phung Lai, Zheng Wu, Thilanka Munasinghe, Jennifer Wei

Main category: cs.LG

TL;DR: This paper explores how explainable AI (XAI) can bridge the gap between predictive accuracy and actionable insights for extreme event forecasting, using wildfire prediction as a case study.

DetailsMotivation: The need for accurate, explainable, and actionable forecasting of extreme events like wildfires is urgent due to climate change, but current AI models' black-box nature limits trust and operational adoption.

Method: The study evaluates various AI models and employs SHapley Additive exPlanations (SHAP) to uncover key features, decision pathways, and potential biases, with supporting visualizations for enhanced interpretability.

Result: XAI clarifies model reasoning and supports critical decision-making by domain experts and response teams, enhancing usability for practitioners and policymakers.

Conclusion: AI systems for extreme event forecasting must be not only accurate but also interpretable, accessible, and trustworthy to be effective in disaster preparedness, risk mitigation, and climate resilience planning.

Abstract: As climate change accelerates the frequency and severity of extreme events such as wildfires, the need for accurate, explainable, and actionable forecasting becomes increasingly urgent. While artificial intelligence (AI) models have shown promise in predicting such events, their adoption in real-world decision-making remains limited due to their black-box nature, which limits trust, explainability, and operational readiness. This paper investigates the role of explainable AI (XAI) in bridging the gap between predictive accuracy and actionable insight for extreme event forecasting. Using wildfire prediction as a case study, we evaluate various AI models and employ SHapley Additive exPlanations (SHAP) to uncover key features, decision pathways, and potential biases in model behavior. Our analysis demonstrates how XAI not only clarifies model reasoning but also supports critical decision-making by domain experts and response teams. In addition, we provide supporting visualizations that enhance the interpretability of XAI outputs by contextualizing feature importance and temporal patterns in seasonality and geospatial characteristics. This approach enhances the usability of AI explanations for practitioners and policymakers. Our findings highlight the need for AI systems that are not only accurate but also interpretable, accessible, and trustworthy, essential for effective use in disaster preparedness, risk mitigation, and climate resilience planning.

[1069] Learning stochasticity: a nonparametric framework for intrinsic noise estimation

Gianluigi Pillonetto, Alberto Giaretta, Mauro Bisiacco

Main category: cs.LG

TL;DR: Trine is a nonparametric, kernel-based framework that infers state-dependent intrinsic noise from time-series data using a three-stage algorithm, achieving oracle-level performance in uncovering hidden dynamics without parametric assumptions.

DetailsMotivation: Bottom-up modeling approaches often fail due to incomplete knowledge of nonlinear interactions and stochastic effects, especially in biological systems where intrinsic noise is essential for understanding dynamics but parametric models struggle without strong priors.

Method: Three-phase regression algorithm combining analytically solvable subproblems with structured kernel architecture that captures both abrupt noise-driven fluctuations and smooth state-dependent variance changes.

Result: Validated on biological and ecological systems, Trine achieves performance comparable to an oracle observer and successfully uncovers hidden dynamics without predefined parametric assumptions.

Conclusion: Trine opens new avenues for understanding how intrinsic noise affects complex system behavior, providing a powerful framework for discovering governing equations from data in systems like gene regulatory networks.

Abstract: Understanding the principles that govern dynamical systems is a central challenge across many scientific domains, including biology and ecology. Incomplete knowledge of nonlinear interactions and stochastic effects often renders bottom-up modeling approaches ineffective, motivating the development of methods that can discover governing equations directly from data. In such contexts, parametric models often struggle without strong prior knowledge, especially when estimating intrinsic noise. Nonetheless, incorporating stochastic effects is often essential for understanding the dynamic behavior of complex systems such as gene regulatory networks and signaling pathways. To address these challenges, we introduce Trine (Three-phase Regression for INtrinsic noisE), a nonparametric, kernel-based framework that infers state-dependent intrinsic noise from time-series data. Trine features a three-stage algorithm that com- bines analytically solvable subproblems with a structured kernel architecture that captures both abrupt noise-driven fluctuations and smooth, state-dependent changes in variance. We validate Trine on biological and ecological systems, demonstrating its ability to uncover hidden dynamics without relying on predefined parametric assumptions. Across several benchmark problems, Trine achieves performance comparable to that of an oracle. Biologically, this oracle can be viewed as an idealized observer capable of directly tracking the random fluctuations in molecular concentrations or reaction events within a cell. The Trine framework thus opens new avenues for understanding how intrinsic noise affects the behavior of complex systems.

[1070] Rare Genomic Subtype Discovery from RNA-seq via Autoencoder Embeddings and Stability-Aware Clustering

Alaa Mezghiche

Main category: cs.LG

TL;DR: Unsupervised learning on RNA-seq data identifies rare genomic subtypes. Pan-cancer analysis clusters by tissue origin, while within-cancer approach reveals a stable rare KIRC subtype.

DetailsMotivation: To discover rare but reproducible molecular subtypes beyond standard labels using unsupervised learning on high-dimensional RNA-seq data.

Method: Combined autoencoder-based representation with clustering and stability analysis. Used feed-forward autoencoder (128D latent space) on top 2000 highly variable genes, k-means clustering (k=2-10), and stability validation with Jaccard index across 20 seeds.

Result: Pan-cancer analysis showed perfect tissue-of-origin clustering (Cramer’s V=0.887). Within KIRC, identified a rare cluster C0 (6.85% of patients) at k=5 that was highly stable (Jaccard=0.787) with coherent differential expression markers.

Conclusion: Pan-cancer clustering is dominated by tissue of origin, while stability-aware within-cancer analysis can reveal rare, reproducible subtypes like the identified KIRC subtype.

Abstract: Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI “Gene Expression Cancer RNA-Seq” dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer’s V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare < 10 percent and stable with Jaccard >= 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch’s t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.

[1071] CG-FedLLM: How to Compress Gradients in Federated Fune-tuning for Large Language Models

Huiwen Wu, Xiaogang Xu, Deyi Zhang, Xiaohan Li, Jiafei Wu, Zhe Liu

Main category: cs.LG

TL;DR: CG-FedLLM introduces gradient compression for federated learning of LLMs to reduce communication costs while maintaining privacy, using encoder-decoder architecture with novel training strategies TGAP and FAF.

DetailsMotivation: Traditional centralized learning for LLMs poses privacy risks, while federated learning incurs high communication costs due to large model parameters. Need for efficient gradient compression in FL for LLMs.

Method: Encoder-decoder architecture for gradient compression, with Temporal-ensemble Gradient-Aware Pre-training (TGAP) to identify characteristic gradients and Federated AutoEncoder-Involved Fine-tuning (FAF) for adaptive compression.

Result: Reduces communication costs and improves performance (average 3 points increment on C-Eval benchmark with LlaMA), filters gradients while preserving critical features, maintains privacy.

Conclusion: CG-FedLLM provides efficient and secure FL for LLMs through gradient compression, enabling privacy-preserving training with improved communication efficiency and performance.

Abstract: The success of current Large-Language Models (LLMs) hinges on extensive training data that is collected and stored centrally, called Centralized Learning (CL). However, such a collection manner poses a privacy threat, and one potential solution is Federated Learning (FL), which transfers gradients, not raw data, among clients. Unlike traditional networks, FL for LLMs incurs significant communication costs due to their tremendous parameters. This study introduces an innovative approach to compress gradients to improve communication efficiency during LLM FL, formulating the new FL pipeline named CG-FedLLM. This approach integrates an encoder on the client side to acquire the compressed gradient features and a decoder on the server side to reconstruct the gradients. We also developed a novel training strategy that comprises Temporal-ensemble Gradient-Aware Pre-training (TGAP) to identify characteristic gradients of the target model and Federated AutoEncoder-Involved Fine-tuning (FAF) to compress gradients adaptively. Extensive experiments confirm that our approach reduces communication costs and improves performance (e.g., average 3 points increment compared with traditional CL- and FL-based fine-tuning with LlaMA on a well-recognized benchmark, C-Eval). This improvement is because our encoder-decoder, trained via TGAP and FAF, can filter gradients while selectively preserving critical features. Furthermore, we present a series of experimental analyses focusing on the signal-to-noise ratio, compression rate, and robustness within this privacy-centric framework, providing insight into developing more efficient and secure LLMs.

[1072] Addressing Polarization and Unfairness in Performative Prediction

Kun Jin, Tian Xie, Yang Liu, Xueru Zhang

Main category: cs.LG

TL;DR: Performative prediction framework models data distribution shifts caused by deployed models, showing that performative stable solutions can cause polarization and fairness issues, with conventional fairness interventions failing under these conditions.

DetailsMotivation: To address the societal impacts of performative prediction, particularly fairness concerns, as prior work focused mainly on robustness while overlooking how performative stable solutions can lead to severe polarization and performance disparities.

Method: Introduce novel fairness mechanisms that provably ensure both stability and fairness in performative prediction settings, validated through theoretical analysis and empirical results.

Result: The proposed fairness mechanisms successfully address the limitations of conventional fairness interventions by ensuring both stability and fairness under model-dependent distribution shifts.

Conclusion: Novel fairness mechanisms are necessary and effective for achieving both stability and fairness in performative prediction settings, overcoming the limitations of traditional approaches that fail under model-induced distribution shifts.

Abstract: In many real-world applications of machine learning such as recommendations, hiring, and lending, deployed models influence the data they are trained on, leading to feedback loops between predictions and data distribution. The performative prediction (PP) framework captures this phenomenon by modeling the data distribution as a function of the deployed model. While prior work has focused on finding performative stable (PS) solutions for robustness, their societal impacts, particularly regarding fairness, remain underexplored. We show that PS solutions can lead to severe polarization and prediction performance disparities, and that conventional fairness interventions in previous works often fail under model-dependent distribution shifts due to failing the PS criteria. To address these challenges in PP, we introduce novel fairness mechanisms that provably ensure both stability and fairness, validated by theoretical analysis and empirical results.

[1073] Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

Yifei Li, Erik-Jan van Kampen

Main category: cs.LG

TL;DR: This paper develops sample-efficient offline RL by exploiting system symmetry through data augmentation and dual-critic architecture, demonstrating improved policy convergence in flight control.

DetailsMotivation: To leverage system symmetry for more efficient offline reinforcement learning, addressing the challenge of limited sample coverage in state-action space.

Method: Proposes symmetric data augmentation for MDPs under symmetry assumption, integrates augmented samples into DDPG dataset, and introduces dual-critic structure with second critic trained on augmented samples.

Result: Aircraft model verified to be symmetric, flight control simulations show accelerated policy convergence when using augmented samples.

Conclusion: Exploiting system symmetry through data augmentation and dual-critic architecture enables more sample-efficient offline RL with faster convergence in control applications.

Abstract: The symmetry of dynamical systems can be exploited for state-transition prediction and to facilitate control policy optimization. This paper leverages system symmetry to develop sample-efficient offline reinforcement learning (RL) approaches. Under the symmetry assumption for a Markov Decision Process (MDP), a symmetric data augmentation method is proposed. The augmented samples are integrated into the dataset of Deep Deterministic Policy Gradient (DDPG) to enhance its coverage rate of the state-action space. Furthermore, sample utilization efficiency is improved by introducing a second critic trained on the augmented samples, resulting in a dual-critic structure. The aircraft’s model is verified to be symmetric, and flight control simulations demonstrate accelerated policy convergence when augmented samples are employed.

[1074] Temporal Test-Time Adaptation with State-Space Models

Mona Schirmer, Dan Zhang, Eric Nalisnick

Main category: cs.LG

TL;DR: STAD is a Bayesian filtering method for test-time adaptation to temporal distribution shifts by learning time-varying dynamics in hidden features and inferring evolving class prototypes without labels.

DetailsMotivation: Distribution shifts between training and test data cause performance decay in deployed models. Most existing methods focus on synthetic corruption shifts, leaving gradual temporal shifts underexplored despite being common in real-world scenarios.

Method: Proposes STAD, a Bayesian filtering approach that adapts models to temporal distribution shifts by learning time-varying dynamics in the last hidden features. It infers time-evolving class prototypes that serve as a dynamic classification head, all without requiring labels.

Result: Experiments on real-world temporal distribution shifts demonstrate that STAD excels in handling small batch sizes and label shift scenarios.

Conclusion: STAD effectively addresses the challenge of gradual temporal distribution shifts in deployed models through Bayesian filtering and dynamic class prototype inference, showing strong performance in real-world conditions.

Abstract: Distribution shifts between training and test data are inevitable over the lifecycle of a deployed model, leading to performance decay. Adapting a model on test samples can help mitigate this drop in performance. However, most test-time adaptation methods have focused on synthetic corruption shifts, leaving a variety of distribution shifts underexplored. In this paper, we focus on distribution shifts that evolve gradually over time, which are common in the wild but challenging for existing methods, as we show. To address this, we propose STAD, a Bayesian filtering method that adapts a deployed model to temporal distribution shifts by learning the time-varying dynamics in the last set of hidden features. Without requiring labels, our model infers time-evolving class prototypes that act as a dynamic classification head. Through experiments on real-world temporal distribution shifts, we show that our method excels in handling small batch sizes and label shift.

[1075] Efficiently Computing Compact Formal Explanations

Min Wu, Xiaofu Li, Haoze Wu, Clark Barrett

Main category: cs.LG

TL;DR: VeriX+ improves upon VeriX by reducing explanation size and generation time through bound propagation-based sensitivity and binary search traversal with confidence ranking, achieving significant performance gains on benchmarks.

DetailsMotivation: To enhance the efficiency and scalability of formal explanations for machine learning models, addressing limitations in explanation size and generation time from previous work.

Method: Combines bound propagation-based sensitivity analysis for size reduction and binary search-based traversal with confidence ranking for time improvement, with optional QuickXplain adaptation for size-time trade-offs.

Result: Achieved 38% size reduction on GTSRB dataset and 90% time reduction on MNIST, demonstrating scalability to transformers and real-world applications like autonomous aircraft taxiing and sentiment analysis.

Conclusion: VeriX+ significantly advances formal explanation capabilities with improved efficiency and scalability, enabling novel applications in various domains.

Abstract: Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning models, we present VeriX+, which significantly improves both the size and the generation time of formal explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time – the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of $38%$ on the GTSRB dataset and a time reduction of $90%$ on MNIST. We demonstrate that our approach is scalable to transformers and real-world scenarios such as autonomous aircraft taxiing and sentiment analysis. We conclude by showcasing several novel applications of formal explanations.

[1076] Communication-Efficient Federated Low-Rank Update Algorithm and its Connection to Implicit Regularization

Haemin Park, Diego Klabjan

Main category: cs.LG

TL;DR: FedLoRU is a federated learning framework that uses low-rank client updates to reduce communication costs while maintaining performance comparable to full-rank methods.

DetailsMotivation: Address communication efficiency challenges and performance reduction in federated learning when scaling to many clients by leveraging theoretical insights about rank properties in FL.

Method: Propose FedLoRU framework that constrains client-side optimization to low-rank subspaces, accumulates low-rank updates to form higher-rank models, and includes variants for heterogeneous environments with multiple or hierarchical low-rank updates.

Result: FedLoRU achieves comparable performance to full-rank algorithms, exhibits robustness to heterogeneous and large numbers of clients, and maintains convergence rate matching FedAvg.

Conclusion: Low-rank updates provide an effective approach for communication-efficient federated learning while maintaining performance, with theoretical guarantees and practical benefits in heterogeneous settings.

Abstract: Federated Learning (FL) faces significant challenges related to communication efficiency and performance reduction when scaling to many clients. To address these issues, we explore the potential of using low-rank updates and provide the first theoretical study of rank properties in FL. Our theoretical analysis shows that a client’s loss exhibits a higher-rank structure (i.e., gradients span higher-rank subspaces of the Hessian) compared to the server’s loss, and that low-rank approximations of the clients’ gradients have greater similarity. Based on this insight, we hypothesize that constraining client-side optimization to a low-rank subspace could provide an implicit regularization effect while reducing communication costs. Consequently, we propose FedLoRU, a general low-rank update framework for FL. Our framework enforces low-rank client-side updates and accumulates these updates to form a higher-rank model. We are able to establish convergence of the algorithm; the convergence rate matches FedAvg. Additionally, variants of FedLoRU can adapt to environments with statistical and model heterogeneity by employing multiple or hierarchical low-rank updates. Experimental results demonstrate that FedLoRU performs comparably to full-rank algorithms and exhibits robustness to heterogeneous and large numbers of clients.

[1077] Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

Main category: cs.LG

TL;DR: Fira is a new plug-and-play training framework for LLMs that achieves full-rank training performance while maintaining low-rank memory efficiency, outperforming existing methods like LoRA and GaLore.

DetailsMotivation: Existing low-rank training methods constrain training to low-rank subspaces, leading to sub-optimal performance. The goal is to achieve full-rank training benefits while preserving low-rank memory efficiency.

Method: 1) Uses norm-based scaling that leverages similar scaling impact of adaptive optimizers from low-rank to full-rank training. 2) Implements norm-growth limiter to smooth gradients and prevent loss spikes by regulating gradient norm increases.

Result: Extensive experiments show Fira outperforms both LoRA and GaLore, achieving performance comparable to or better than full-rank training while maintaining memory efficiency.

Conclusion: Fira successfully bridges the gap between low-rank memory efficiency and full-rank training performance, providing a practical solution for efficient LLM training without sacrificing model quality.

Abstract: Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

[1078] On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

Renpu Liu, Ruida Zhou, Cong Shen, Jing Yang

Main category: cs.LG

TL;DR: Transformers can perform learning-to-optimize (L2O) algorithms for in-context learning tasks, achieving linear convergence rates and outperforming standard gradient descent.

DetailsMotivation: To understand how Transformers achieve superior in-context learning capabilities, particularly their ability to perform optimization algorithms beyond standard gradient descent.

Method: Theoretical analysis showing that K-layer Transformers can implement L2O algorithms for sparse recovery (LASSO) tasks with provable linear convergence rates in K.

Result: Transformers can solve sparse recovery problems with different measurement matrices, leverage structural information for faster convergence, and generalize across varying demonstration lengths - capabilities where conventional L2O algorithms fail.

Conclusion: Transformers’ ability to perform L2O algorithms provides a new explanation for their superior in-context learning performance, even with few layers, and offers advantages over traditional optimization methods.

Abstract: An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the standard gradient-descent algorithms. Moreover, unlike the conventional L2O algorithms that require the measurement matrix involved in training to match that in testing, the trained Transformer is able to solve sparse recovery problems generated with different measurement matrices. Besides, Transformers as an L2O algorithm can leverage structural information embedded in the training tasks to accelerate its convergence during ICL, and generalize across different lengths of demonstration pairs, where conventional L2O algorithms typically struggle or fail. Such theoretical findings are supported by our experimental results.

[1079] Near-Optimal Reinforcement Learning with Shuffle Differential Privacy

Shaojie Bai, Mohammad Sadegh Talebi, Chengcheng Zhao, Peng Cheng, Jiming Chen

Main category: cs.LG

TL;DR: This paper introduces SDP-PE, the first shuffle model-based algorithm for reinforcement learning that achieves strong privacy guarantees without centralized trust, offering superior privacy-regret trade-off compared to existing approaches.

DetailsMotivation: Address privacy concerns in RL applications for networked systems, where existing DP models are inadequate - centralized model creates single point of failure risk, while local model causes significant performance degradation.

Method: Developed Shuffle Differentially Private Policy Elimination (SDP-PE) algorithm with novel exponential batching schedule and “forgetting” mechanism to balance privacy and learning performance under the shuffle model.

Result: SDP-PE achieves near-optimal regret bound with utility comparable to centralized model while significantly outperforming local model, demonstrating superior privacy-regret trade-off.

Conclusion: Establishes viability of shuffle model for secure data-driven decision-making in networked systems, providing strong privacy guarantees without centralized trust assumption.

Abstract: Reinforcement learning (RL) is a powerful tool for sequential decision-making, but its application is often hindered by privacy concerns arising from its interaction data. This challenge is particularly acute in advanced networked systems, where learning from operational and user data can expose systems to privacy inference attacks. Existing differential privacy (DP) models for RL are often inadequate: the centralized model requires a fully trusted server, creating a single point of failure risk, while the local model incurs significant performance degradation that is unsuitable for many networked applications. This paper addresses this gap by leveraging the emerging shuffle model of privacy, an intermediate trust model that provides strong privacy guarantees without a centralized trust assumption. We present Shuffle Differentially Private Policy Elimination (SDP-PE), the first generic policy elimination-based algorithm for episodic RL under the shuffle model. Our method introduces a novel exponential batching schedule and a ``forgetting’’ mechanism to balance the competing demands of privacy and learning performance. Our analysis shows that SDP-PE achieves a near-optimal regret bound, demonstrating a superior privacy-regret trade-off with utility comparable to the centralized model while significantly outperforming the local model. The numerical experiments also corroborate our theoretical results and demonstrate the effectiveness of SDP-PE. This work establishes the viability of the shuffle model for secure data-driven decision-making in networked systems.

[1080] Competence-Aware AI Agents with Metacognition for Unknown Situations and Environments (MUSE)

Rodolfo Valiente, Praveen K. Pilly

Main category: cs.LG

TL;DR: The paper proposes MUSE framework that integrates metacognitive processes into autonomous agents to improve adaptation in novel environments through competence awareness and strategy selection.

DetailsMotivation: Current autonomous agents struggle in novel environments due to limited adaptation capacity, while metacognition enables human adaptability in unknown situations.

Method: Proposed MUSE framework with two implementations: world modeling-based and LLM-based approaches for continual competence assessment and iterative strategy selection.

Result: MUSE agents demonstrate high competence awareness and significant improvements in self-regulation for solving novel, out-of-distribution tasks compared to model-based RL and prompt-based LLM approaches.

Conclusion: Metacognition-inspired approaches show promise in enabling autonomous agents to adapt to new environments while reducing reliance on extensive training data and large models.

Abstract: Metacognition, defined as the awareness and regulation of one’s cognitive processes, is central to human adaptability in unknown situations. In contrast, current autonomous agents often struggle in novel environments due to their limited capacity for adaptation. We hypothesize that metacognition is a critical missing ingredient in autonomous agents for the cognitive flexibility needed to tackle unfamiliar challenges. Given the broad scope of metacognitive abilities, we focus on competence awareness and strategy selection. To this end, we propose the Metacognition for Unknown Situations and Environments (MUSE) framework to integrate metacognitive processes of self-assessment and self-regulation into autonomous agents. We present two implementations of MUSE: one based on world modeling and another leveraging large language models (LLMs). Our system continually learns to assess its competence on a given task and uses this self-assessment to guide iterative cycles of strategy selection. MUSE agents demonstrate high competence awareness and significant improvements in self-regulation for solving novel, out-of-distribution tasks more effectively compared to model-based reinforcement learning and purely prompt-based LLM agent approaches. This work highlights the promise of approaches inspired by cognitive and neural systems in enabling autonomous agents to adapt to new environments while mitigating the heavy reliance on extensive training data and large models for the current models.

[1081] Loss Patterns of Neural Networks

Ivan Skorokhodov, Mikhail Burtsev

Main category: cs.LG

TL;DR: Multi-point optimization enables training multiple models simultaneously without storing individual parameters, used to analyze neural network loss landscapes. Experiments show diverse loss patterns and that batch normalization smooths the landscape.

DetailsMotivation: To conduct a thorough empirical analysis of neural network loss landscapes and understand their complexity and the effects of techniques like batch normalization.

Method: Multi-point optimization technique that trains several models simultaneously without storing individual parameters, tested on FashionMNIST and CIFAR10 datasets.

Result: 1) Loss surface contains surprisingly diverse and intricate landscape patterns, 2) Adding batch normalization makes the loss landscape more smooth.

Conclusion: The proposed multi-point optimization method effectively reveals complex loss landscape patterns and demonstrates that batch normalization contributes to landscape smoothing.

Abstract: We present multi-point optimization: an optimization technique that allows to train several models simultaneously without the need to keep the parameters of each one individually. The proposed method is used for a thorough empirical analysis of the loss landscape of neural networks. By extensive experiments on FashionMNIST and CIFAR10 datasets we demonstrate two things: 1) loss surface is surprisingly diverse and intricate in terms of landscape patterns it contains, and 2) adding batch normalization makes it more smooth. Source code to reproduce all the reported results is available on GitHub: https://github.com/universome/loss-patterns.

[1082] Achieving Fairness with a Simple Ridge Penalty

Marco Scutari, Francesca Panero, Manuel Proissl

Main category: cs.LG

TL;DR: A framework for fair regression using ridge penalty to control fairness, with better performance than existing methods.

DetailsMotivation: To develop a mathematically simple and interpretable framework for enforcing fairness in regression models while maintaining good predictive performance.

Method: Use ridge penalty as a model selection step to control fairness, then estimate parameters conditional on chosen penalty. Extensible to GLMs, kernel regression, and multiple fairness definitions.

Result: Outperforms Komiyama et al. (2018) and Zafar et al. (2019) methods on six datasets with better goodness of fit and predictive accuracy at same fairness levels. Identified bias in Komiyama et al.’s evaluation.

Conclusion: The proposed framework provides an effective, interpretable, and extensible approach to fair regression that achieves superior performance compared to existing methods.

Abstract: In this paper we present a general framework for estimating regression models subject to a user-defined level of fairness. We enforce fairness as a model selection step in which we choose the value of a ridge penalty to control the effect of sensitive attributes. We then estimate the parameters of the model conditional on the chosen penalty value. Our proposal is mathematically simple, with a solution that is partly in closed form, and produces estimates of the regression coefficients that are intuitive to interpret as a function of the level of fairness. Furthermore, it is easily extended to generalised linear models, kernelised regression models and other penalties; and it can accommodate multiple definitions of fairness. We compare our approach with the regression model from Komiyama et al. (2018), which implements a provably-optimal linear regression model; and with the fair models from Zafar et al. (2019). We evaluate these approaches empirically on six different data sets, and we find that our proposal provides better goodness of fit and better predictive accuracy for the same level of fairness. In addition, we highlight a source of bias in the original experimental evaluation in Komiyama et al. (2018).

[1083] State-Space Constraints Can Improve the Generalisation of the Differentiable Neural Computer to Input Sequences With Unseen Length

Patrick Ofner, Roman Kern

Main category: cs.LG

TL;DR: State compression and regularization improve DNC’s generalization to longer sequences without retraining, enabling processing of sequences up to 10.4x longer than baseline.

DetailsMotivation: Memory-augmented neural networks often fail to generalize to input sequence lengths not seen during training, limiting their practical applicability.

Method: Two approaches: state compression and state regularization to constrain the controller network’s state space in differentiable neural computers (DNCs).

Result: Constrained DNCs processed sequences 2.3x longer than baseline and could extend memory matrix without retraining to handle sequences 10.4x longer, though improvements varied across tasks.

Conclusion: State-space constraints enable training DNCs with shorter sequences, saving computational resources and facilitating training when long sequences are costly.

Abstract: Memory-augmented neural networks (MANNs) can perform algorithmic tasks such as sorting. However, they often fail to generalise to input sequence lengths not encountered during training. We introduce two approaches that constrain the state space of the MANN’s controller network: state compression and state regularisation. We empirically demonstrated that both approaches can improve generalisation to input sequences of out-of-distribution lengths for a specific type of MANN: the differentiable neural computer (DNC). The constrained DNC could process input sequences that were up to 2.3 times longer than those processed by an unconstrained baseline controller network. Notably, the applied constraints enabled the extension of the DNC’s memory matrix without the need for retraining and thus allowed the processing of input sequences that were 10.4 times longer. However, the improvements were not consistent across all tested algorithmic tasks. Interestingly, solutions that performed better often had a highly structured state space, characterised by state trajectories exhibiting increased curvature and loop-like patterns. Our experimental work demonstrates that state-space constraints can enable the training of a DNC using shorter input sequences, thereby saving computational resources and facilitating training when acquiring long sequences is costly.

[1084] Deep Clustering via Gradual Community Detection

Tianyu Cheng, Qun Chen

Main category: cs.LG

TL;DR: A novel deep clustering strategy using gradual community detection that initializes with many pseudo-communities and merges them, leveraging network analysis to improve cluster purity and self-supervision performance.

DetailsMotivation: Deep clustering faces challenges due to inadequate supervision signals, and existing methods need better ways to leverage structural characteristics for improved performance.

Method: Proposes gradual community detection: initializes clustering by partitioning samples into many pseudo-communities, then gradually expands clusters through community merging, incorporating cluster network analysis perspective.

Result: Extensive experiments on benchmark image datasets show the proposed strategy effectively improves state-of-the-art performance, with ablation studies confirming improved community pseudo-label purity.

Conclusion: The community detection approach with network analysis perspective effectively enhances cluster pseudo-label purity, leading to improved self-supervision and overall clustering performance.

Abstract: Deep clustering is an essential task in modern artificial intelligence, aiming to partition a set of data samples into a given number of homogeneous groups (i.e., clusters). Recent studies have proposed increasingly advanced deep neural networks and training strategies for deep clustering, effectively improving performance. However, deep clustering generally remains challenging due to the inadequacy of supervision signals. Building upon the existing representation learning backbones, this paper proposes a novel clustering strategy of gradual community detection. It initializes clustering by partitioning samples into many pseudo-communities and then gradually expands clusters by community merging. Compared with the existing clustering strategies, community detection factors in the new perspective of cluster network analysis in the clustering process. The new perspective can effectively leverage global structural characteristics to enhance cluster pseudo-label purity, which is critical to the performance of self-supervision. We have implemented the proposed approach based on the popular backbones and evaluated its efficacy on benchmark image datasets. Our extensive experiments have shown that the proposed clustering strategy can effectively improve the SOTA performance. Our ablation study also demonstrates that the new network perspective can effectively improve community pseudo-label purity, resulting in improved self-supervision.

[1085] Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Lyle Regenwetter, Akash Srivastava, Dan Gutfreund, Faez Ahmed

Main category: cs.LG

TL;DR: This paper provides a comprehensive review and practical guide for evaluating deep generative models in engineering design, highlighting limitations of traditional ML metrics and proposing design-specific evaluation criteria.

DetailsMotivation: Traditional statistical metrics for deep generative models don't adequately capture engineering design requirements like constraint satisfaction, functional performance, and novelty.

Method: The authors curate design-specific metrics from various research communities, apply them to 2D visualization examples, and evaluate four deep generative models on bicycle frame design and structural topology generation problems.

Result: The paper demonstrates how design-specific metrics can quantify performance target achievement, design novelty, and geometric constraints in engineering applications.

Conclusion: Establishes a framework for evaluating deep generative models in engineering design using domain-specific metrics that better align with design requirements than traditional ML metrics.

Abstract: Deep generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize the well-accepted `classic’ evaluation metrics for deep generative models grounded in machine learning theory. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. Throughout our discussion, we apply the metrics to models trained on simple-to-visualize 2-dimensional example problems. Finally, we evaluate four deep generative models on a bicycle frame design problem and structural topology generation problem. In particular, we showcase the use of proposed metrics to quantify performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at https://decode.mit.edu/projects/metrics/.

[1086] MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework

Ping Guo, Cheng Gong, Xi Lin, Fei Liu, Zhichao Lu, Qingfu Zhang, Zhenkun Wang

Main category: cs.LG

TL;DR: MOS Attack is a multi-objective adversarial attack framework that automatically discovers synergistic patterns among multiple loss functions to generate more potent attacks with fewer objectives.

DetailsMotivation: Existing single-objective adversarial attacks don't fully utilize multiple loss functions due to insufficient understanding of their synergistic and conflicting relationships, limiting attack effectiveness.

Method: Proposes Multi-Objective Set-based Attack (MOS Attack) using set-based multi-objective optimization to incorporate multiple loss functions without extra parameters and automatically mine synergistic patterns among them.

Result: Extensive experiments show MOS Attack outperforms single-objective attacks and maintains superior performance even with reduced number of loss functions by leveraging discovered synergistic patterns.

Conclusion: MOS Attack effectively addresses limitations of single-objective methods by automatically discovering and leveraging synergistic relationships among multiple loss functions for more efficient and powerful adversarial attacks.

Abstract: Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions. Our code is available at https://github.com/pgg3/MOS-Attack.

[1087] Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Avik Kar, Rahul Singh

Main category: cs.LG

TL;DR: The paper proposes adaptive reinforcement learning algorithms for continuous Lipschitz MDPs that achieve sublinear regret by zooming into promising policy regions, with performance depending on problem complexity measures.

DetailsMotivation: To develop efficient infinite-horizon average-reward RL algorithms that adapt to problem complexity and achieve better performance on benign instances or when competing against low-complexity policy sets.

Method: Proposes two algorithms: PZRL-MF (model-free) and PZRL-MB (model-based) that explore policy space by ‘zooming’ into promising regions using a zooming dimension d^Φ_z to measure complexity.

Result: Achieves regret bounds of Õ(T^{1 - d_eff^{-1}}) where d_eff = d^Φ_z+2 for model-free and d_eff = 2d_S + d^Φ_z+3 for model-based algorithms, with better performance on low-complexity problems.

Conclusion: The algorithms provide adaptive performance with regret scaling favorably with problem complexity, achieving Õ(√T) regret under curvature conditions similar to multi-armed bandits.

Abstract: We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set $Φ$. The proposed algorithms efficiently explore the policy space by ‘‘zooming’’ into the ‘‘promising regions’’ of $Φ$, thereby achieving adaptivity gains in the performance. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = d^Φ_z+2$ for model-free algoritahm $\textit{PZRL-MF}$ and $d_{\text{eff.}} = 2d_\mathcal{S} + d^Φ_z + 3$ for model-based algorithm $\textit{PZRL-MB}$. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^Φ_z$ is the zooming dimension given a set of policies $Φ$. $d^Φ_z$ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on $Φ$. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity $Φ$ (that has a small $d^Φ_z$). When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 2$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for $\textit{PZRL-MF}$, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

[1088] GLANCE: Global Actions in a Nutshell for Counterfactual Explainability

Loukas Kavouras, Eleni Psaroudaki, Konstantinos Tsopelas, Dimitrios Rontogiannis, Nikolaos Theologitis, Dimitris Sacharidis, Giorgos Giannopoulos, Dimitrios Tomaras, Kleopatra Markou, Dimitrios Gunopulos, Dimitris Fotakis, Ioannis Emiris

Main category: cs.LG

TL;DR: GLANCE is a novel algorithm for generating global counterfactual explanations that balances effectiveness, cost, and interpretability through an agglomerative approach.

DetailsMotivation: Need for counterfactual explainability methods that provide actionable recourse to large population subgroups while maintaining practical costs and interpretability.

Method: Uses an agglomerative approach that jointly considers feature space and counterfactual actions, accounting for data distribution and model structure with tunable size parameter.

Result: GLANCE demonstrates greater robustness and performance compared to existing methods across various datasets and models.

Conclusion: GLANCE effectively balances the trade-offs between effectiveness, cost, and interpretability in global counterfactual explanations.

Abstract: The widespread deployment of machine learning systems in critical real-world decision-making applications has highlighted the urgent need for counterfactual explainability methods that operate effectively. Global counterfactual explanations, expressed as actions to offer recourse, aim to provide succinct explanations and insights applicable to large population subgroups. High effectiveness, measured by the fraction of the population that is provided recourse, ensures that the actions benefit as many individuals as possible. Keeping the cost of actions low ensures the proposed recourse actions remain practical and actionable. Limiting the number of actions that provide global counterfactuals is essential to maximizing interpretability. The primary challenge, therefore, is to balance these trade-offs–maximizing effectiveness, minimizing cost, while maintaining a small number of actions. We introduce $\texttt{GLANCE}$, a versatile and adaptive algorithm that employs a novel agglomerative approach, jointly considering both the feature space and the space of counterfactual actions, thereby accounting for the distribution of points in a way that aligns with the model’s structure. This design enables the careful balancing of the trade-offs among the three key objectives, with the size objective functioning as a tunable parameter to keep the actions few and easy to interpret. Our extensive experimental evaluation demonstrates that $\texttt{GLANCE}$ consistently shows greater robustness and performance compared to existing methods across various datasets and models.

[1089] Optimizing Urban Service Allocation with Time-Constrained Restless Bandits

Yi Mao, Andrew Perrault

Main category: cs.LG

TL;DR: This paper develops a novel restless multi-armed bandit approach with window constraints for optimizing municipal inspection scheduling, using Chicago food establishment inspections as a case study.

DetailsMotivation: To balance multiple objectives in municipal inspections: ensuring guideline adherence, minimizing disruption to establishments, and reducing inspection costs, while handling surprise inspections and annual window constraints.

Method: Extended Whittle index-based systems for RMABs with guaranteed action window constraints, combining MDP reformulation and integer programming-based lookahead. Used neural network-based supervised learning to model state transitions from real inspection data.

Result: Achieved 10% AUC improvement in state transition modeling compared to direct failure prediction. Showed 24% improvement in simulation and 33% improvement on real data for inspection impact, with robustness to surprise inspections.

Conclusion: The proposed approach effectively optimizes inspection scheduling under complex constraints and provides insights into the impact of scheduling constraints on inspection effectiveness.

Abstract: Municipal inspections are an important part of maintaining the quality of goods and services. In this paper, we approach the problem of intelligently scheduling service inspections to maximize their impact, using the case of food establishment inspections in Chicago as a case study. The Chicago Department of Public Health (CDPH) inspects thousands of establishments each year, with a substantial fail rate (over 3,000 failed inspection reports in 2023). To balance the objectives of ensuring adherence to guidelines, minimizing disruption to establishments, and minimizing inspection costs, CDPH assigns each establishment an inspection window every year and guarantees that they will be inspected exactly once during that window. Meanwhile, CDPH also promises surprise public health inspections for unexpected food safety emergencies or complaints. These constraints create a challenge for a restless multi-armed bandit (RMAB) approach, for which there are no existing methods. We develop an extension to Whittle index-based systems for RMABs that can guarantee action window constraints and frequencies, and furthermore can be leveraged to optimize action window assignments themselves. Briefly, we combine MDP reformulation and integer programming-based lookahead to maximize the impact of inspections subject to constraints. A neural network-based supervised learning model is developed to model state transitions of real Chicago establishments using public CDPH inspection records, which demonstrates 10% AUC improvements compared with directly predicting establishments’ failures. Our experiments not only show up to 24% (in simulation) or 33% (on real data) objective improvements resulting from our approach and robustness to surprise inspections, but also give insight into the impact of scheduling constraints.

[1090] Uncertainty Quantification for Deep Learning

Peter Jan van Leeuwen, J. Christine Chiu, C. Kevin Yang

Main category: cs.LG

TL;DR: A comprehensive framework for statistically consistent uncertainty quantification in deep learning regression that addresses all major uncertainty sources and demonstrates practical advantages in real-world applications.

DetailsMotivation: To address partial uncertainty coverage and inconsistencies in existing deep learning uncertainty quantification methods by providing a statistically consistent framework.

Method: Apply Bayes’ theorem and conditional probability densities to systematically quantify uncertainty from input data, training/testing data, neural network weights, and model imperfections, with a fast practical implementation.

Result: Successfully demonstrated on regression problems including cloud autoconversion rate prediction, showing training/testing data uncertainty dominates, followed by input data, model, and weight variability.

Conclusion: Explicitly modeling training data uncertainty improves robustness to out-of-distribution inputs and enhances model reliability in real-world scenarios.

Abstract: We present a critical survey on the consistency of uncertainty quantification used in deep learning and highlight partial uncertainty coverage and many inconsistencies. We then provide a comprehensive and statistically consistent framework for uncertainty quantification in deep learning that accounts for all major sources of uncertainty: input data, training and testing data, neural network weights, and machine-learning model imperfections, targeting regression problems. We systematically quantify each source by applying Bayes’ theorem and conditional probability densities and introduce a fast, practical implementation method. We demonstrate its effectiveness on a simple regression problem and a real-world application: predicting cloud autoconversion rates using a neural network trained on aircraft measurements from the Azores and guided by a two-moment bin model of the stochastic collection equation. In this application, uncertainty from the training and testing data dominates, followed by input data, neural network model, and weight variability. Finally, we highlight the practical advantages of this methodology, showing that explicitly modeling training data uncertainty improves robustness to new inputs that fall outside the training data, and enhances model reliability in real-world scenarios.

[1091] DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Mogens Henrik From, Jacob Nielsen, Lukas Galke Poech, Peter Schneider-Kamp

Main category: cs.LG

TL;DR: FlexDeMo enables efficient distributed training of large models by sharding parameters locally and synchronizing only fast-moving gradient components between nodes, achieving similar performance to full gradient synchronization but with better speed.

DetailsMotivation: To address the computational challenges of training large neural networks distributed across multiple nodes and accelerators, while relaxing the assumption that models must fit on a single accelerator.

Method: Introduces FlexDeMo, a hybrid sharded data parallel training strategy where nodes fully shard model parameters locally between accelerators, and inter-node communication is reduced by synchronizing only fast-moving gradient components instead of full gradients.

Result: FlexDeMo attains similar validation loss as hybrid sharded data parallel training with full gradient synchronization and AdamW, while being substantially faster across language and vision domains.

Conclusion: FlexDeMo is a promising distributed training scheme for the largest machine learning models, offering efficient training through reduced communication overhead while maintaining performance.

Abstract: Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients – resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo – introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

[1092] Finite basis Kolmogorov-Arnold networks: domain decomposition for data-driven and physics-informed problems

Amanda A. Howard, Bruno Jacob, Sarah Helfert, Alexander Heinlein, Panos Stinis

Main category: cs.LG

TL;DR: Domain decomposition method for Kolmogorov-Arnold networks (KANs) enables parallel training of small KANs for multiscale problems, improving efficiency and accuracy.

DetailsMotivation: KANs are promising for scientific machine learning but expensive to train, especially for small networks. Need efficient training methods for multiscale problems.

Method: Developed finite basis KANs (FBKANs) using domain decomposition, allowing parallel training of multiple small KANs inspired by FBPINNs approach.

Result: FBKANs provide accurate solutions with noisy data and for physics-informed training, demonstrating effectiveness for multiscale problems.

Conclusion: Domain decomposition enables efficient parallel training of KANs, making them practical for scientific applications with multiscale challenges.

Abstract: Kolmogorov-Arnold networks (KANs) have attracted attention recently as an alternative to multilayer perceptrons (MLPs) for scientific machine learning. However, KANs can be expensive to train, even for relatively small networks. Inspired by finite basis physics-informed neural networks (FBPINNs), in this work, we develop a domain decomposition method for KANs that allows for several small KANs to be trained in parallel to give accurate solutions for multiscale problems. We show that finite basis KANs (FBKANs) can provide accurate results with noisy data and for physics-informed training.

[1093] Individualised Treatment Effects Estimation with Composite Treatments and Composite Outcomes

Vinod Kumar Chauhan, Lei Clifton, Gaurav Nigam, David A. Clifton

Main category: cs.LG

TL;DR: Proposes H-Learner, a hypernetwork-based approach for estimating individualised treatment effects with composite treatments and outcomes, addressing data scarcity through information sharing.

DetailsMotivation: Existing causal ML methods are limited to single treatments and outcomes, hindering application in complex real-world scenarios like healthcare where multiple interventions affect multiple outcomes.

Method: Uses hypernetwork-based architecture to dynamically share information across treatments and outcomes, tackling data scarcity issues in composite treatment-outcome settings.

Result: Empirical analysis shows H-Learner outperforms existing methods for both binary and arbitrary composite treatments and outcomes.

Conclusion: H-Learner effectively addresses the challenge of ITE estimation with composite treatments and outcomes, enabling more realistic causal inference in complex scenarios.

Abstract: Estimating individualised treatment effect (ITE) – that is the causal effect of a set of variables (also called exposures, treatments, actions, policies, or interventions), referred to as \textit{composite treatments}, on a set of outcome variables of interest, referred to as \textit{composite outcomes}, for a unit from observational data – remains a fundamental problem in causal inference with applications across disciplines, such as healthcare, economics, education, social science, marketing, and computer science. Previous work in causal machine learning for ITE estimation is limited to simple settings, like single treatments and single outcomes. This hinders their use in complex real-world scenarios; for example, consider studying the effect of different ICU interventions, such as beta-blockers and statins for a patient admitted for heart surgery, on different outcomes of interest such as atrial fibrillation and in-hospital mortality. The limited research into composite treatments and outcomes is primarily due to data scarcity for all treatments and outcomes. To address the above challenges, we propose a novel and innovative hypernetwork-based approach, called \emph{H-Learner}, to solve ITE estimation under composite treatments and composite outcomes, which tackles the data scarcity issue by dynamically sharing information across treatments and outcomes. Our empirical analysis with binary and arbitrary composite treatments and outcomes demonstrates the effectiveness of the proposed approach compared to existing methods.

[1094] Exploiting Missing Data Remediation Strategies using Adversarial Missingness Attacks

Deniz Koyuncu, Alex Gittens, Bülent Yener, Moti Yung

Main category: cs.LG

TL;DR: AM attacks manipulate model fitting by engineering missing data problems without inserting or perturbing data, achieving malicious objectives through bi-level optimization that incorporates differentiable approximations of missingness remediation techniques.

DetailsMotivation: To extend AM attacks beyond the limitation of requiring full-information maximum likelihood methods, making them applicable to more practical missing data handling techniques commonly used in real-world scenarios.

Method: Proposes a bi-level optimization framework where the adversary engineers adversarial missingness mechanisms, with the lower level incorporating differentiable approximations of targeted missingness remediation techniques (complete case analysis, mean imputation, and regression-based imputation) for general ERM problems.

Result: AM attacks succeed with modest missingness levels (<20%) and can manipulate ATE estimates - reversing sign and inflating values from -1.61% to as high as 10% on the Twins dataset, even when restricted to modifying only subsets of training data.

Conclusion: AM attacks are effective across multiple missing data handling techniques and can significantly manipulate model outcomes without traditional data poisoning methods, highlighting vulnerabilities in standard missing data remediation approaches.

Abstract: Adversarial Missingness (AM) attacks aim to manipulate model fitting by carefully engineering a missing data problem to achieve a specific malicious objective. AM attacks are significantly different from prior data poisoning attacks in that no malicious data inserted and no data is maliciously perturbed. Current AM attacks are feasible only under the assumption that the modeler (victim) uses full-information maximum likelihood methods to handle missingness. This work aims to remedy this limitation of AM attacks; in the approach taken here, the adversary achieves their goal by solving a bi-level optimization problem to engineer the adversarial missingness mechanism, where the lower level problem incorporates a differentiable approximation of the targeted missingness remediation technique. As instantiations of this framework, AM attacks are provided for three popular techniques: (i) complete case analysis, (ii) mean imputation, and (iii) regression-based imputation for general empirical risk minimization (ERM) problems. Experiments on real-world data show that AM attacks are successful with modest levels of missingness (less than 20%). Furthermore, we show on the real-world Twins dataset that AM attacks can manipulate the estimated average treatment effect (ATE) as an instance of the general ERM problems: the adversary succeeds in not only reversing the sign, but also in substantially inflating the ATE values from a true value of -1.61% to a manipulated one as high as 10%. These experimental results hold when the ATE is calculated using multiple regression-based estimators with different architectures, even when the adversary is restricted to modifying only a subset of the training data.

[1095] Explanation-Preserving Augmentation for Semi-Supervised Graph Representation Learning

Zhuomin Chen, Jingchao Ni, Hojat Allah Salehi, Xu Zheng, Esteban Schafir, Farhad Shirani, Dongsheng Luo

Main category: cs.LG

TL;DR: EPA-GRL uses graph explanations to create semantics-preserving augmentations for self-supervised graph representation learning, outperforming state-of-the-art methods.

DetailsMotivation: Most self-supervised GRL methods focus only on data-perturbation without ensuring semantics-preservation, leading to suboptimal solutions. The gap between effective augmentation requiring both aspects motivates the use of graph explanations.

Method: EPA-GRL first trains a graph explainer with few labels to identify label-explanatory subgraphs, then uses these explanations to generate semantics-preserving augmentations for self-supervised learning.

Result: Theoretical analysis and extensive experiments on benchmark datasets show EPA-GRL outperforms state-of-the-art GRL methods that use semantics-agnostic augmentations.

Conclusion: Leveraging graph explanations for semantics-preserving augmentation significantly improves self-supervised graph representation learning performance.

Abstract: Self-supervised graph representation learning (GRL) typically generates paired graph augmentations from each graph to infer similar representations for augmentations of the same graph, but distinguishable representations for different graphs. While effective augmentation requires both semantics-preservation and data-perturbation, most existing GRL methods focus solely on data-perturbation, leading to suboptimal solutions. To fill the gap, in this paper, we propose a novel method, Explanation-Preserving Augmentation (EPA), which leverages graph explanation for semantics-preservation. EPA first uses a small number of labels to train a graph explainer, which infers the subgraphs that explain the graph’s label. Then these explanations are used for generating semantics-preserving augmentations for boosting self-supervised GRL. Thus, the entire process, namely EPA-GRL, is semi-supervised. We demonstrate theoretically, using an analytical example, and through extensive experiments on a variety of benchmark datasets, that EPA-GRL outperforms the state-of-the-art (SOTA) GRL methods that use semantics-agnostic augmentations. The code is available at https://github.com/realMoana/EPA-GRL.

[1096] DeepMIDE: A Multi-Output Spatio-Temporal Method for Ultra-Scale Offshore Wind Energy Forecasting

Feng Ye, Xinxi Zhang, Michael Stein, Ahmed Aziz Ezzat

Main category: cs.LG

TL;DR: DeepMIDE is a deep learning method for multi-height offshore wind forecasting that models wind speeds across space, time, and height using physics-informed advection vectors.

DetailsMotivation: Traditional wind forecasting methods focus on single heights, but larger offshore turbines require forecasting across multiple heights to access stronger winds at higher altitudes.

Method: Multi-output integro-difference equation model with multivariate nonstationary kernel using physics-informed advection vectors learned from weather data via deep learning architecture.

Result: Outperforms prevalent time series, spatio-temporal, and deep learning methods on real-world data from Northeastern US offshore wind energy areas.

Conclusion: DeepMIDE successfully addresses the need for multi-height wind forecasting in offshore wind energy through physics-informed deep learning.

Abstract: To unlock access to stronger winds, the offshore wind industry is advancing towards significantly larger and taller wind turbines. This massive upscaling motivates a departure from wind forecasting methods that traditionally focused on a single representative height. To fill this gap, we propose DeepMIDE–a statistical deep learning method which jointly models the offshore wind speeds across space, time, and height. DeepMIDE is formulated as a multi-output integro-difference equation model with a multivariate nonstationary kernel characterized by a set of advection vectors that encode the physics of wind field formation and propagation. Embedded within DeepMIDE, an advanced deep learning architecture learns these advection vectors from high-dimensional streams of exogenous weather information, which, along with other parameters, are plugged back into the statistical model for probabilistic multi-height space-time forecasting. Tested on real-world data from offshore wind energy areas in the Northeastern United States, the wind speed and power forecasts from DeepMIDE are shown to outperform those from prevalent time series, spatio-temporal, and deep learning methods.

[1097] EXAGREE: Mitigating Explanation Disagreement with Stakeholder-Aligned Models

Sichao Li, Tommy Liu, Quanling Deng, Amanda S. Barnard

Main category: cs.LG

TL;DR: EXAGREE is a framework that selects stakeholder-aligned explanation models by maximizing agreement between human stakeholders and machine explanations, unifying faithfulness and plausibility into a single metric.

DetailsMotivation: Conflicting explanations from different attribution methods hinder ML adoption in safety-critical domains. The paper aims to turn this disagreement into an advantage by selecting models that align with stakeholder preferences.

Method: Two-stage framework using a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting for gradient-based search in constrained model space to maximize Stakeholder-Machine Agreement (SMA).

Result: Experiments on six datasets show simultaneous improvements in faithfulness, plausibility, and fairness over baselines while maintaining task accuracy. Ablation studies confirm robustness.

Conclusion: EXAGREE provides a practical method to select explanation models that align with stakeholder preferences, addressing the challenge of conflicting explanations in safety-critical ML applications.

Abstract: Conflicting explanations, arising from different attribution methods or model internals, limit the adoption of machine learning models in safety-critical domains. We turn this disagreement into an advantage and introduce EXplanation AGREEment (EXAGREE), a two-stage framework that selects a Stakeholder-Aligned Explanation Model (SAEM) from a set of similar-performing models. The selection maximizes Stakeholder-Machine Agreement (SMA), a single metric that unifies faithfulness and plausibility. EXAGREE couples a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting, enabling gradient-based search inside the constrained model space. Experiments on six real-world datasets demonstrate simultaneous gains of faithfulness, plausibility, and fairness over baselines, while preserving task accuracy. Extensive ablation studies, significance tests, and case studies confirm the robustness and feasibility of the method in practice.

[1098] Foundation Model in Biomedicine

Xiangrui Liu, Yuanyuan Zhang, Qianyu Shang, Yingzhou Lu, Changchang Yin, Xiaoling Hu, Xiaoou Liu, Lulu Chen, Alexander Rodríguez, Yezhou Yang, Ping Zhang, Jintai Chen, Shan Du, Huaxiu Yao, Sheng Wang, Tianfan Fu, Xiao Wang

Main category: cs.LG

TL;DR: Survey on biomedical foundation models - large pretrained AI models applied to healthcare domains like drug discovery, medical imaging, and clinical informatics.

DetailsMotivation: Foundation models have shown broad applicability across fields, and their potential in biomedical domains represents a significant milestone for advancing medical research and practice through AI.

Method: This is a survey paper that explores and reviews the application of foundation models (large language models and vision-language models) across various biomedical fields.

Result: The survey examines how foundation models can be applied to computational biology, drug discovery, clinical informatics, medical imaging, and public health, demonstrating their versatility in healthcare applications.

Conclusion: The purpose is to inspire ongoing research in applying foundation models to health science, highlighting their potential to transform biomedical research and medical practice.

Abstract: Foundation models, first introduced in 2021, refer to large-scale pretrained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in the use of artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models in diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.

[1099] Physically Interpretable World Models via Weakly Supervised Representation Learning

Zhenjiang Mao, Mrinall Eashaan Umasudhan, Ivan Ruchkin

Main category: cs.LG

TL;DR: PIWM is a framework that learns physically interpretable world models from images by aligning latent representations with real-world physical quantities and constraining their evolution through known physical dynamics, without requiring ground-truth physical annotations.

DetailsMotivation: Standard world models lack physical interpretability, limiting their reliability, generalizability, and applicability to safety-critical tasks in cyber-physical systems.

Method: Uses weak distribution-based supervision, integrates VQ-based visual encoder, transformer-based physical encoder, and learnable dynamics model grounded in known physical equations to achieve physical interpretability without ground-truth annotations.

Result: Across three case studies (Cart Pole, Lunar Lander, Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models.

Conclusion: Demonstrates the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision, enabling more reliable and generalizable models for safety-critical applications.

Abstract: Learning predictive models from high-dimensional sensory observations is fundamental for cyber-physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety-critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real-world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground-truth physical annotations, PIWM employs weak distribution-based supervision that captures state uncertainty naturally arising from real-world sensing pipelines. The architecture integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.

[1100] Self-Supervised Learning Using Nonlinear Dependence

M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth

Main category: cs.LG

TL;DR: CDSSL is a self-supervised learning framework that integrates both linear correlations and nonlinear dependencies using HSIC to improve representation learning in high-dimensional visual data.

DetailsMotivation: Existing SSL methods focus on feature variance and linear correlations but neglect complex nonlinear dependencies and sample-wise interactions in high-dimensional data.

Method: Proposes Correlation-Dependence SSL (CDSSL) that unifies linear correlations and nonlinear dependencies using Hilbert-Schmidt Independence Criterion (HSIC) in a Reproducing Kernel Hilbert Space.

Result: Experimental evaluations on diverse benchmarks show CDSSL effectively improves representation quality compared to existing methods.

Conclusion: CDSSL successfully addresses limitations of current SSL approaches by capturing both linear and nonlinear dependencies, enhancing representation learning for complex visual data.

Abstract: Self-supervised learning has gained significant attention in contemporary applications, particularly due to the scarcity of labeled data. While existing SSL methodologies primarily address feature variance and linear correlations, they often neglect the intricate relations between samples and the nonlinear dependencies inherent in complex data–especially prevalent in high-dimensional visual data. In this paper, we introduce Correlation-Dependence Self-Supervised Learning (CDSSL), a novel framework that unifies and extends existing SSL paradigms by integrating both linear correlations and nonlinear dependencies, encapsulating sample-wise and feature-wise interactions. Our approach incorporates the Hilbert-Schmidt Independence Criterion (HSIC) to robustly capture nonlinear dependencies within a Reproducing Kernel Hilbert Space, enriching representation learning. Experimental evaluations on diverse benchmarks demonstrate the efficacy of CDSSL in improving representation quality.

[1101] Emotional EEG Classification using Upscaled Connectivity Matrices

Chae-Won Lee, Jong-Seok Lee

Main category: cs.LG

TL;DR: Proposes upscaling connectivity matrices before CNN processing to preserve important emotional EEG patterns that get lost during convolution operations.

DetailsMotivation: Current CNN approaches using connectivity matrices for emotional EEG classification lose important patterns during convolutional operations, limiting classification performance.

Method: Upscale connectivity matrices before feeding them into convolutional neural networks to strengthen local patterns and prevent information loss.

Result: Experimental results show significant enhancement in classification performance compared to standard connectivity matrix approaches.

Conclusion: Simple upscaling of connectivity matrices is an effective method to improve emotional EEG classification by preserving important inter-regional interaction patterns.

Abstract: In recent studies of emotional EEG classification, connectivity matrices have been successfully employed as input to convolutional neural networks (CNNs), which can effectively consider inter-regional interaction patterns in EEG. However, we find that such an approach has a limitation that important patterns in connectivity matrices may be lost during the convolutional operations in CNNs. To resolve this issue, we propose and validate an idea to upscale the connectivity matrices to strengthen the local patterns. Experimental results demonstrate that this simple idea can significantly enhance the classification performance.

[1102] Learning Theory for Kernel Bilevel Optimization

Fares El Khoury, Edouard Pauwels, Samuel Vaiter, Michael Arbel

Main category: cs.LG

TL;DR: First theoretical analysis of nonparametric bilevel optimization using kernel methods, deriving generalization bounds and assessing statistical accuracy of gradient-based methods.

DetailsMotivation: Bilevel optimization is widely used in ML but lacks theoretical foundation for nonparametric settings. This paper aims to bridge this gap by studying kernel-based bilevel optimization.

Method: Proposes Kernel Bilevel Optimization (KBO) where inner objective is optimized over reproducing kernel Hilbert space. Uses empirical process theory to derive finite-sample generalization bounds.

Result: Derives novel generalization bounds for KBO and assesses statistical accuracy of gradient-based methods applied to empirical discretization of KBO.

Conclusion: Provides first theoretical foundation for nonparametric bilevel optimization, with numerical validation on synthetic instrumental variable regression.

Abstract: Bilevel optimization has emerged as a technique for addressing a wide range of machine learning problems that involve an outer objective implicitly determined by the minimizer of an inner problem. While prior works have primarily focused on the parametric setting, a learning-theoretic foundation for bilevel optimization in the nonparametric case remains relatively unexplored. In this paper, we take a first step toward bridging this gap by studying Kernel Bilevel Optimization (KBO), where the inner objective is optimized over a reproducing kernel Hilbert space. This setting enables rich function approximation while providing a foundation for rigorous theoretical analysis. In this context, we derive novel finite-sample generalization bounds for KBO, leveraging tools from empirical process theory. These bounds further allow us to assess the statistical accuracy of gradient-based methods applied to the empirical discretization of KBO. We numerically illustrate our theoretical findings on a synthetic instrumental variable regression task.

[1103] Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization

Linfeng Cao, Ming Shi, Ness B. Shroff

Main category: cs.LG

TL;DR: This paper introduces a preference-aware multi-objective multi-armed bandit framework that shifts from traditional Pareto optimality to customized optimization within the Pareto front based on explicit user preferences.

DetailsMotivation: Real-world scenarios involve users with varying preferences, making traditional Pareto optimal arms potentially poor for some users, highlighting the need for customized learning often overlooked in prior research.

Method: Developed preference-aware MO-MAB framework with preference estimation and optimization mechanisms for two scenarios: unknown preference and hidden preference, using novel analytical techniques.

Result: Established near-optimal regret bounds for proposed algorithms and demonstrated strong empirical performance confirming the effectiveness of the approach.

Conclusion: This is the first theoretical study of customized MO-MAB optimization with explicit user preferences, successfully addressing the gap in preference-aware learning within multi-objective bandit problems.

Abstract: Multi-objective multi-armed bandit (MO-MAB) problems traditionally aim to achieve Pareto optimality. However, real-world scenarios often involve users with varying preferences across objectives, resulting in a Pareto-optimal arm that may score high for one user but perform quite poorly for another. This highlights the need for customized learning, a factor often overlooked in prior research. To address this, we study a preference-aware MO-MAB framework in the presence of explicit user preference. It shifts the focus from achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To our knowledge, this is the first theoretical study of customized MO-MAB optimization with explicit user preferences. Motivated by practical applications, we explore two scenarios: unknown preference and hidden preference, each presenting unique challenges for algorithm design and analysis. At the core of our algorithms are preference estimation and preference-aware optimization mechanisms to adapt to user preferences effectively. We further develop novel analytical techniques to establish near-optimal regret of the proposed algorithms. Strong empirical performance confirm the effectiveness of our approach.

[1104] Simultaneous Swap Regret Minimization via KL-Calibration

Haipeng Luo, Spandan Senapati, Vatsal Sharan

Main category: cs.LG

TL;DR: This paper significantly generalizes previous calibration results by achieving O(T^{1/3}) swap regret for proper losses with twice continuously differentiable univariate forms, introduces pseudo KL-Calibration as a stronger notion, and provides explicit algorithms with improved bounds.

DetailsMotivation: To generalize and improve upon recent calibration results by Fishelson et al. (2025), addressing limitations in the types of proper losses covered and the nature of regret bounds (pseudo vs actual swap regret).

Method: Introduces pseudo KL-Calibration as a stronger calibration notion equivalent to swap regret for log loss, develops a new randomized rounding procedure and non-uniform discretization scheme to minimize swap regret for log loss, and provides explicit algorithms achieving the bounds.

Result: Achieves O(T^{1/3}) swap regret for any proper loss with twice continuously differentiable univariate form (including Tsallis entropy), O(T^{1/3}) KL-Calibration error, and O(T^{1/3}(log T)^{-1/3}log(T/δ)) swap regret with high probability for smooth univariate forms, implying O(T^{1/3}) ℓ_2-Calibration error.

Conclusion: The work significantly extends calibration theory by covering broader classes of proper losses, providing actual swap regret bounds (not just pseudo), and introducing stronger calibration notions with practical algorithms achieving optimal rates.

Abstract: Calibration is a fundamental concept that aims at ensuring the reliability of probabilistic predictions by aligning them with real-world outcomes. There is a surge of studies on new calibration measures that are easier to optimize compared to the classical $\ell_1$-Calibration while still having strong implications for downstream applications. One recent such example is the work by Fishelson et al. (2025) who show that it is possible to achieve $O(T^{1/3})$ pseudo $\ell_2$-Calibration error via minimizing pseudo swap regret of the squared loss, which in fact implies the same bound for all bounded proper losses with a smooth univariate form. In this work, we significantly generalize their result in the following ways: (a) in addition to smooth univariate forms, our algorithm also simultaneously achieves $O(T^{1/3})$ swap regret for any proper loss with a twice continuously differentiable univariate form (such as Tsallis entropy); (b) our bounds hold not only for pseudo swap regret that measures losses using the forecaster’s distributions on predictions, but also hold for the actual swap regret that measures losses using the forecaster’s actual realized predictions. We achieve so by introducing a new stronger notion of calibration called (pseudo) KL-Calibration, which we show is equivalent to the (pseudo) swap regret for log loss. We prove that there exists an algorithm that achieves $O(T^{1/3})$ KL-Calibration error and provide an explicit algorithm that achieves $O(T^{1/3})$ pseudo KL-Calibration error. Moreover, we show that the same algorithm achieves $O(T^{1/3}(\log T)^{-1/3}\log(T/δ))$ swap regret w.p. $\ge 1-δ$ for any proper loss with a smooth univariate form, which implies $O(T^{1/3})$ $\ell_2$-Calibration error. A technical contribution of our work is a new randomized rounding procedure and a non-uniform discretization scheme to minimize the swap regret for log loss.

[1105] An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses

Hao Liang, Wanrong Zhang, Xinlei He, Kaishun Wu, Hong Xing

Main category: cs.LG

TL;DR: This paper provides rigorous privacy analysis for DPSGD with non-convex loss functions, showing privacy loss convergence without convexity assumptions and improved privacy-utility trade-offs with bounded domains.

DetailsMotivation: Current DPSGD privacy guarantees have limitations: they often assume convexity or impose complex parameters, and rarely deeply analyze how privacy mechanisms affect model utility. There's a need for more practical privacy analysis without restrictive assumptions.

Method: The authors track privacy loss over multiple iterations using noisy smooth-reduction properties, establish comprehensive convergence analysis for different scenarios, and validate insights through membership inference attack experiments.

Result: For DPSGD with bounded domain: (i) privacy loss converges without convexity, (ii) smaller bounded diameter improves both privacy and utility under certain conditions, (iii) better privacy-utility trade-offs are achievable for DPSGD-GC and DPSGD-DC with strongly convex functions.

Conclusion: The theoretical analysis provides practical insights for DPSGD deployment, showing that bounded domains enable privacy loss convergence and improved privacy-utility trade-offs even for non-convex problems, validated by experimental results.

Abstract: Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantee often comes at a large cost of model performance due to the lack of tight theoretical bounds quantifying privacy loss. While recent efforts have achieved more accurate privacy guarantees, they still impose some assumptions prohibited from practical applications, such as convexity and complex parameter requirements, and rarely investigate in-depth the impact of privacy mechanisms on the model’s utility. In this paper, we provide a rigorous privacy characterization for DPSGD with general L-smooth and non-convex loss functions, revealing converged privacy loss with iteration in bounded-domain cases. Specifically, we track the privacy loss over multiple iterations, leveraging the noisy smooth-reduction property, and further establish comprehensive convergence analysis in different scenarios. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions, and (iii) the attainable big-O order of the privacy utility trade-off for DPSGD with gradient clipping (DPSGD-GC) and for DPSGD-GC with bounded domain (DPSGD-DC) and mu-strongly convex population risk function, respectively. Experiments via membership inference attack (MIA) in a practical setting validate insights gained from the theoretical results.

[1106] Graph Neural Network-Based Reinforcement Learning for Controlling Biological Networks - the GATTACA Framework

Andrzej Mizera, Jakub Zarzycki

Main category: cs.LG

TL;DR: GATTACA is a scalable deep reinforcement learning framework that controls Boolean network models for cellular reprogramming, using graph neural networks to handle large biological systems efficiently.

DetailsMotivation: Cellular reprogramming has therapeutic potential but wet-lab experiments are time-consuming and expensive. Computational approaches are needed to identify effective reprogramming strategies more efficiently.

Method: Developed GATTACA framework using deep reinforcement learning with graph neural networks to control Boolean network models under asynchronous update mode, incorporating pseudo-attractor states for scalability.

Result: Experiments on large-scale real-world biological networks demonstrated the framework’s scalability and effectiveness in cellular reprogramming tasks.

Conclusion: The proposed DRL-based framework provides a scalable and effective computational approach for cellular reprogramming that can handle complex biological systems.

Abstract: Cellular reprogramming, the artificial transformation of one cell type into another, has been attracting increasing research attention due to its therapeutic potential for complex diseases. However, identifying effective reprogramming strategies through classical wet-lab experiments is hindered by lengthy time commitments and high costs. In this study, we explore the use of deep reinforcement learning (DRL) to control Boolean network models of complex biological systems, such as gene regulatory and signalling pathway networks. We formulate a novel control problem for Boolean network models under the asynchronous update mode, specifically in the context of cellular reprogramming. To solve it, we devise GATTACA, a scalable computational framework. To facilitate scalability of our framework, we consider previously introduced concept of a pseudo-attractor and improve the procedure for effective identification of pseudo-attractor states. We then incorporate graph neural networks with graph convolution operations into the artificial neural network approximator of the DRL agent’s action-value function. This allows us to leverage the available knowledge on the structure of a biological system and to indirectly, yet effectively, encode the system’s modelled dynamics into a latent representation. Experiments on several large-scale, real-world biological networks from the literature demonstrate the scalability and effectiveness of our approach.

[1107] Statistical Deficiency for Task Inclusion Estimation

Loïc Fosse, Frédéric Béchet, Benoît Favre, Géraldine Damnati, Gwénolé Lecorvé, Maxime Darrin, Philippe Formont, Pablo Piantanida

Main category: cs.LG

TL;DR: The paper proposes a theoretical framework to define tasks and measure task inclusion using statistical deficiency concepts, with a tractable proxy called information sufficiency that validates on synthetic data and reconstructs the NLP pipeline.

DetailsMotivation: Current machine learning lacks well-founded tools to study the structure of task spaces, despite the importance of tasks for assessing model capabilities and the trend toward general models that can handle any task.

Method: Develop a theoretically grounded setup to define tasks and compute task inclusion from a statistical deficiency perspective, proposing information sufficiency as a tractable proxy for estimating task inclusion.

Result: The proposed information sufficiency proxy is shown to be sound on synthetic data and successfully used to empirically reconstruct the classic NLP pipeline.

Conclusion: The framework provides a principled way to study task relationships and inclusion, offering tools to understand the structure of task spaces in machine learning.

Abstract: Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the {\bf inclusion} between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.

[1108] CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen

Main category: cs.LG

TL;DR: CAD-VAE is a correlation-aware disentangled VAE that introduces a correlated latent code to capture shared information between target and sensitive attributes, enabling fairer representations without requiring domain knowledge.

DetailsMotivation: Deep generative models may inherit or amplify biases by encoding sensitive attributes alongside predictive features, and strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated.

Method: Introduces a correlated latent code to capture shared information between target and sensitive attributes, minimizes conditional mutual information between target and sensitive codes, and uses relevance-driven optimization to refine the correlated code by capturing essential correlated features and eliminating redundancy.

Result: Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.

Conclusion: CAD-VAE effectively addresses bias and fairness issues in deep generative models by properly handling natural correlations between target and sensitive attributes through correlation-aware disentanglement.

Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose \textbf{CAD-VAE} (\textbf{C}orrelation-\textbf{A}ware \textbf{D}isentangled \textbf{VAE}), which introduces a correlated latent code to capture the information shared between the target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing. Source code is available : https://github.com/merry7cherry/CAD-VAE

[1109] SineLoRA$Δ$: Sine-Activated Delta Compression

Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey

Main category: cs.LG

TL;DR: SineLoRAΔ improves delta compression for resource-constrained weight deployment by enhancing quantized low-rank adapters with sinusoidal activations, achieving up to 66% memory reduction while maintaining performance across various domains.

DetailsMotivation: Address limitations of parameter-efficient updates like LoRA in delta compression scenarios, especially when combined with aggressive quantization, by improving representation capacity without adding parameters.

Method: Extends recent work using fixed-frequency sinusoidal functions to increase stable rank, applies this to quantized setting with theoretical analysis of stable rank evolution under quantization, and introduces SineLoRAΔ with sinusoidal activation for quantized low-rank adapters.

Result: Validated across language modeling, vision-language tasks, and text-to-image generation, achieving up to 66% memory reduction with similar performance. Also provides novel application of Bjøntegaard Delta metric for consistent adapter compression comparison.

Conclusion: SineLoRAΔ is a principled and effective method for delta compression that significantly improves expressivity of quantized low-rank adapters while maintaining performance across diverse domains.

Abstract: Resource-constrained weight deployment is a task of immense practical importance. Recently, there has been interest in the specific task of \textit{Delta Compression}, where parties each hold a common base model and only communicate compressed weight updates. However, popular parameter efficient updates such as Low Rank Adaptation (LoRA) face inherent representation limitations - which are especially pronounced when combined with aggressive quantization. To overcome this, we build on recent work that improves LoRA representation capacity by using fixed-frequency sinusoidal functions to increase stable rank without adding additional parameters. We extend this to the quantized setting and present the first theoretical analysis showing how stable rank evolves under quantization. From this, we introduce SineLoRA$Δ$, a principled and effective method for delta compression that improves the expressivity of quantized low-rank adapters by applying a sinusoidal activation. We validate SineLoRA$Δ$ across a diverse variety of domains - including language modeling, vision-language tasks, and text-to-image generation - achieving up to 66% memory reduction with similar performance. We additionally provide a novel application of the canonical Bjøntegaard Delta metric to consistently compare adapter compression changes across the rate-distortion curve.

[1110] ELECTRA: A Cartesian Network for 3D Charge Density Prediction with Floating Orbitals

Jonas Elsborg, Luca Thiede, Alán Aspuru-Guzik, Tejs Vegge, Arghya Bhowmik

Main category: cs.LG

TL;DR: ELECTRA is an equivariant model that predicts electronic charge densities using floating Gaussian orbitals positioned freely in space, achieving state-of-the-art accuracy and reducing DFT computation time by 50.72% through better initialization.

DetailsMotivation: Traditional quantum chemistry methods center orbitals at atomic positions, requiring extensive domain knowledge for optimal floating orbital placement. ELECTRA aims to automate this process data-drivenly for more compact and accurate representations.

Method: Uses a Cartesian tensor network with symmetry-breaking mechanism to predict orbital positions and coefficients. Employs Gaussian orbitals inspired by Gaussian Splatting, predicting weights and covariance matrices while preserving rotation equivariance of charge density.

Result: Achieves state-of-the-art balance between computational efficiency and predictive accuracy. Reduces DFT self-consistent field iterations by 50.72% on unseen molecules when used for initialization.

Conclusion: ELECTRA successfully automates floating orbital placement through data-driven learning, providing both accurate charge density predictions and significant computational savings for DFT calculations.

Abstract: We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) - an equivariant model for predicting electronic charge densities using floating orbitals. Floating orbitals are a long-standing concept in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding the ideal placement of these orbitals requires extensive domain knowledge, though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict the orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussian orbitals and predicting their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks. Furthermore, ELECTRA is able to lower the compute time required to arrive at converged DFT solutions - initializing calculations using our predicted densities yields an average 50.72 % reduction in self-consistent field (SCF) iterations on unseen molecules.

[1111] GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance

Jiri Navratil, Jarret Ross, Payel Das, Youssef Mroueh, Samuel C Hoffman, Vijil Chenthamarakshan, Brian Belgodere

Main category: cs.LG

TL;DR: A training-free method called GP-MoLFormer-Sim that guides molecular generation using similarity to target molecules, integrated with genetic algorithms for molecular optimization tasks.

DetailsMotivation: The need to design molecules while preserving similarity to target molecules and properties is crucial for drug discovery and chemical design applications.

Method: Leverages contextual representations from a Chemical Language Model (CLM) to estimate molecular similarity, then adjusts autoregressive sampling to preserve similarity. Integrated with genetic algorithms for optimization.

Result: GP-MoLFormer-Sim+GA outperforms existing training-free baseline methods on molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design.

Conclusion: The method advances understanding of generative mechanisms in Chemical Language Models and provides effective guidance for molecular generation while preserving similarity.

Abstract: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed $\sim$47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.

[1112] Deep Joint Distribution Optimal Transport for Universal Domain Adaptation on Time Series

Romain Mussard, Fannia Pacheco, Maxime Berar, Gilles Gasso, Paul Honeine

Main category: cs.LG

TL;DR: UniJDOT is a Universal Domain Adaptation method for Time Series that uses optimal transport with unknown sample consideration, joint decision space, auto-thresholding, and Fourier-based representation to improve discriminability and robustness.

DetailsMotivation: Existing UniDA methods for Time Series are limited by fixed/fine-tuned thresholds and discriminability metrics that exhibit overconfidence for unknown samples, leading to misclassifications.

Method: Proposes UniJDOT with optimal transport accounting for unknown samples, joint decision space for better discriminability, auto-thresholding algorithm, and Fourier transform-based layer for time series representation.

Result: Experiments on time series benchmarks demonstrate improved discriminability, robustness, and state-of-the-art performance compared to existing methods.

Conclusion: UniJDOT effectively addresses the challenges of Universal Domain Adaptation for Time Series through its novel approach combining optimal transport, joint decision space, auto-thresholding, and Fourier-based representation.

Abstract: Universal Domain Adaptation (UniDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain, even when their classes are not fully shared. Few dedicated UniDA methods exist for Time Series (TS), which remains a challenging case. In general, UniDA approaches align common class samples and detect unknown target samples from emerging classes. Such detection often results from thresholding a discriminability metric. The threshold value is typically either a fine-tuned hyperparameter or a fixed value, which limits the ability of the model to adapt to new data. Furthermore, discriminability metrics exhibit overconfidence for unknown samples, leading to misclassifications. This paper introduces UniJDOT, an optimal-transport-based method that accounts for the unknown target samples in the transport cost. Our method also proposes a joint decision space to improve the discriminability of the detection module. In addition, we use an auto-thresholding algorithm to reduce the dependence on fixed or fine-tuned thresholds. Finally, we rely on a Fourier transform-based layer inspired by the Fourier Neural Operator for better TS representation. Experiments on TS benchmarks demonstrate the discriminability, robustness, and state-of-the-art performance of UniJDOT.

[1113] Reward Redistribution via Gaussian Process Likelihood Estimation

Minheng Xiao, Xian Yu

Main category: cs.LG

TL;DR: GP-LRR is a Gaussian process-based reward redistribution framework that models reward dependencies between state-action pairs to handle sparse and delayed rewards in reinforcement learning.

DetailsMotivation: Existing reward redistribution methods assume independent per-step rewards, overlooking interdependencies among state-action pairs, which limits their effectiveness in handling sparse and delayed feedback.

Method: Models the reward function as a Gaussian process sample, capturing dependencies through kernel functions, and maximizes likelihood of observed episodic returns using a leave-one-out strategy with uncertainty regularization.

Result: When integrated with Soft Actor-Critic, GP-LRR provides dense and informative reward signals, achieving superior sample efficiency and policy performance on MuJoCo benchmarks.

Conclusion: GP-LRR effectively addresses sparse reward problems by modeling reward dependencies, with conventional MSE-based redistribution emerging as a special case of this more general framework.

Abstract: In many practical reinforcement learning tasks, feedback is only provided at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state-action pairs. In this paper, we propose a Gaussian process based Likelihood Reward Redistribution (GP-LRR) framework that addresses this issue by modeling the reward function as a sample from a Gaussian process, which explicitly captures dependencies between state-action pairs through the kernel function. By maximizing the likelihood of the observed episodic return via a leave-one-out strategy that leverages the entire trajectory, our framework inherently introduces uncertainty regularization. Moreover, we show that conventional mean-squared-error (MSE) based reward redistribution arises as a special case of our GP-LRR framework when using a degenerate kernel without observation noise. When integrated with an off-policy algorithm such as Soft Actor-Critic, GP-LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on several MuJoCo benchmarks.

[1114] MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

Javier Lopez-Piqueres, Pranav Deshpande, Archan Ray, Mattia J. Villani, Marco Pistoia, Niraj Kumar

Main category: cs.LG

TL;DR: MetaTT is a Tensor Train adapter framework for fine-tuning transformers that uses a single shared TT to factorize transformer sub-modules, achieving parameter efficiency through factorization that scales with sum rather than product of modes.

DetailsMotivation: To enable flexible and parameter-efficient model adaptation for pre-trained transformers by creating a more compact adapter that scales efficiently with model dimensions.

Method: Uses Tensor Train (TT) decomposition to factorize transformer sub-modules with a single shared TT that indexes structural dimensions (layer, matrix type, heads, tasks), and incorporates a rank-adaptive optimizer inspired by DMRG method.

Result: MetaTT achieves competitive parameter efficiency to accuracy tradeoff on single-task language modeling benchmarks and performs competitively on multi-task learning compared to state-of-the-art methods like LoRA.

Conclusion: The TT-based factorization approach provides substantial parameter efficiency, and the rank-adaptive optimizer enhances optimization performance when integrated with AdamW for specified target ranks.

Abstract: We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT’s parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.

[1115] Extendable Planning via Multiscale Diffusion

Chang Chen, Hany Hamed, Doojin Baek, Taegu Kang, Samyeul Noh, Yoshua Bengio, Sungjin Ahn

Main category: cs.LG

TL;DR: Proposes a two-phase solution for long-horizon planning using progressive trajectory extension and hierarchical multiscale diffusion to overcome limitations of diffusion-based planners.

DetailsMotivation: Diffusion-based planners like Diffuser are limited by training trajectory lengths, creating a dilemma where long trajectories are needed for effective planning but degrade model performance.

Method: Two-phase approach: 1) Progressive Trajectory Extension for incremental trajectory construction, 2) Hierarchical Multiscale Diffuser for efficient training and inference across temporal scales, with Adaptive Plan Pondering and Recursive HM-Diffuser to unify hierarchical planning in a single model.

Result: Experiments show strong performance gains in long-horizon planning tasks.

Conclusion: The approach advances scalable and efficient decision-making over long horizons by overcoming trajectory length limitations of diffusion-based planners.

Abstract: Long-horizon planning is crucial in complex environments, but diffusion-based planners like Diffuser are limited by the trajectory lengths observed during training. This creates a dilemma: long trajectories are needed for effective planning, yet they degrade model performance. In this paper, we introduce this extendable long-horizon planning challenge and propose a two-phase solution. First, Progressive Trajectory Extension incrementally constructs longer trajectories through multi-round compositional stitching. Second, the Hierarchical Multiscale Diffuser enables efficient training and inference over long horizons by reasoning across temporal scales. To avoid the need for multiple separate models, we propose Adaptive Plan Pondering and the Recursive HM-Diffuser, which unify hierarchical planning within a single model. Experiments show our approach yields strong performance gains, advancing scalable and efficient decision-making over long-horizons.

[1116] Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation

Srihas Yarlagadda, Amey Agrawal, Elton Pinto, Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, Alexey Tumanov

Main category: cs.LG

TL;DR: Maya is a performance modeling system that uses transparent device emulation to accurately predict training performance without requiring code modifications, achieving <5% error and reducing training costs by up to 56%.

DetailsMotivation: Current performance modeling systems force users to translate workloads into custom specification languages, creating a semantic gap that forces tradeoffs between usability, generality, and accuracy.

Method: Maya operates at the interface between training frameworks and accelerator devices, intercepting device API calls from unmodified training code to directly observe low-level operations through transparent device emulation.

Result: Maya achieves less than 5% prediction error across diverse models and optimization strategies, identifying configurations that reduce training costs by up to 56% compared to existing approaches.

Conclusion: Maya eliminates the tradeoffs in existing performance modeling systems by providing accurate performance prediction while maintaining ease of use and generality through transparent device emulation.

Abstract: Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training recipes through error-prone trial-and-error on expensive compute clusters. To enable efficient exploration of training configurations, researchers have developed performance modeling systems. However, these systems force users to translate their workloads into custom specification languages, introducing a fundamental semantic gap between the actual workload and its representation. This gap creates an inherent tradeoff: systems must either support a narrow set of workloads to maintain usability, require complex specifications that limit practical adoption, or compromise prediction accuracy with simplified performance models. We present Maya, a performance modeling system that eliminates these tradeoffs through transparent device emulation. By operating at the narrow interface between training frameworks and accelerator devices, Maya can capture complete workload behavior without requiring code modifications or translations. Maya intercepts device API calls from unmodified training code to directly observe low-level operations, enabling accurate performance prediction while maintaining both ease of use and generality. Our evaluation shows Maya achieves less than 5% prediction error across diverse models and optimization strategies, identifying configurations that reduce training costs by up to 56% compared to existing approaches.

[1117] Saturation Self-Organizing Map

Igor Urbanik, Paweł Gajewski

Main category: cs.LG

TL;DR: SatSOM extends Self-Organizing Maps with a saturation mechanism to reduce catastrophic forgetting in continual learning by freezing well-trained neurons and redirecting learning to underutilized areas.

DetailsMotivation: Self-Organizing Maps suffer from catastrophic forgetting in sequential tasks, limiting their effectiveness in continual learning scenarios despite their interpretability and efficiency.

Method: Introduces a saturation mechanism that gradually reduces learning rate and neighborhood radius of neurons as they accumulate information, effectively freezing well-trained neurons.

Result: The method improves knowledge retention in continual learning by redirecting learning to underutilized areas of the map.

Conclusion: SatSOM provides an effective extension to SOMs that enhances their capability for continual learning through controlled neuron saturation.

Abstract: Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.

[1118] RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence

Zengyuan Lai, Jiarui Yang, Songpengcheng Xia, Lizhou Lin, Lan Sun, Renwen Wang, Jianran Liu, Qi Wu, Ling Pei

Main category: cs.LG

TL;DR: RadarLLM is the first framework using large language models for human motion understanding from radar signals, achieving state-of-the-art performance with privacy-preserving capabilities.

DetailsMotivation: Millimeter-wave radar provides privacy-preserving and environment-robust sensing for human motion analysis, but sparse point clouds make semantic understanding challenging.

Method: Introduces motion-guided radar tokenizer with Aggregate VQ-VAE architecture and radar-aware language model for cross-modal alignment; generates synthetic radar-text data using physics-aware pipeline.

Result: Achieves state-of-the-art performance on synthetic and real-world benchmarks, enabling robust and interpretable motion understanding in adverse environments.

Conclusion: RadarLLM successfully bridges radar sensing and language models, offering privacy-preserving human motion analysis that works well under challenging conditions.

Abstract: Millimeter-wave radar offers a privacy-preserving and environment-robust alternative to vision-based sensing, enabling human motion analysis in challenging conditions such as low light, occlusions, rain, or smoke. However, its sparse point clouds pose significant challenges for semantic understanding. We present RadarLLM, the first framework that leverages large language models (LLMs) for human motion understanding from radar signals. RadarLLM introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture, integrating deformable body templates and masked trajectory modeling to convert spatial-temporal radar sequences into compact semantic tokens; and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To overcome the scarcity of paired radar-text data, we generate a realistic radar-text dataset from motion-text datasets with a physics-aware synthesis pipeline. Extensive experiments on both synthetic and real-world benchmarks show that RadarLLM achieves state-of-the-art performance, enabling robust and interpretable motion understanding under privacy and visibility constraints, even in adverse environments. This paper has been accepted for presentation at AAAI 2026. This is an extended version with supplementary materials.

[1119] Toward Explainable Offline RL: Analyzing Representations in Intrinsically Motivated Decision Transformers

Leonardo Guiducci, Antonio Rizzo, Giovanna Maria Dimitri

Main category: cs.LG

TL;DR: A systematic explainability framework reveals how intrinsic motivation shapes embedding structures in Elastic Decision Transformers, showing it acts as a representational prior that creates environment-specific organizational patterns for better decision-making.

DetailsMotivation: While intrinsic motivation improves EDT performance in exploration tasks, the underlying representational mechanisms remain unexplored. The paper aims to understand how intrinsic motivation shapes learned embeddings in EDTs.

Method: Introduced a post-hoc explainability framework analyzing embedding properties through statistical analysis of covariance structure, vector magnitudes, and orthogonality across different intrinsic motivation variants.

Result: Different intrinsic motivation variants create fundamentally different representational structures. Environment-specific correlation patterns between embedding metrics and performance explain why intrinsic motivation improves policy learning.

Conclusion: Intrinsic motivation operates beyond simple exploration bonuses, acting as a representational prior that shapes embedding geometry in biologically plausible ways, creating environment-specific organizational structures that facilitate better decision-making.

Abstract: Elastic Decision Transformers (EDTs) have proved to be particularly successful in offline reinforcement learning, offering a flexible framework that unifies sequence modeling with decision-making under uncertainty. Recent research has shown that incorporating intrinsic motivation mechanisms into EDTs improves performance across exploration tasks, yet the representational mechanisms underlying these improvements remain unexplored. In this paper, we introduce a systematic post-hoc explainability framework to analyze how intrinsic motivation shapes learned embeddings in EDTs. Through statistical analysis of embedding properties (including covariance structure, vector magnitudes, and orthogonality), we reveal that different intrinsic motivation variants create fundamentally different representational structures. Our analysis demonstrates environment-specific correlation patterns between embedding metrics and performance that explain why intrinsic motivation improves policy learning. These findings show that intrinsic motivation operates beyond simple exploration bonuses, acting as a representational prior that shapes embedding geometry in biologically plausible ways, creating environment-specific organizational structures that facilitate better decision-making.

[1120] Appa: Bending Weather Dynamics with Latent Diffusion Models for Global Data Assimilation

Gérôme Andry, Sacha Lewin, François Rozet, Omer Rochman, Victor Mangeleer, Matthias Pirlet, Elise Faulx, Marilaure Grégoire, Gilles Louppe

Main category: cs.LG

TL;DR: Appa is a score-based data assimilation model that generates global atmospheric trajectories at high resolution using a latent diffusion model, handling reanalysis, filtering, and forecasting within a single framework.

DetailsMotivation: Accurate weather forecasting requires identifying the current atmospheric state from observational data, which deep learning can advance through improved data assimilation methods.

Method: Uses a 565M-parameter latent diffusion model trained on ERA5 data, capable of being conditioned on arbitrary observations to infer plausible trajectories without retraining.

Result: The model produces physically consistent reconstructions from various inputs at 0.25° resolution and 1-hour intervals.

Conclusion: Latent score-based data assimilation shows promise as a foundation for future global atmospheric modeling systems.

Abstract: Deep learning has advanced weather forecasting, but accurate predictions first require identifying the current state of the atmosphere from observational data. In this work, we introduce Appa, a score-based data assimilation model generating global atmospheric trajectories at 0.25\si{\degree} resolution and 1-hour intervals. Powered by a 565M-parameter latent diffusion model trained on ERA5, Appa can be conditioned on arbitrary observations to infer plausible trajectories, without retraining. Our probabilistic framework handles reanalysis, filtering, and forecasting, within a single model, producing physically consistent reconstructions from various inputs. Results establish latent score-based data assimilation as a promising foundation for future global atmospheric modeling systems.

[1121] DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision-Language Transformers to Missing Modalities

Jueqing Lu, Yuanyuan Qi, Xiaohao Yang, Shuaicheng Niu, Fucai Ke, Shujie Zhou, Wei Tan, Jionghao Lin, Wray Buntine, Hamid Rezatofighi, Lan Du

Main category: cs.LG

TL;DR: DPL introduces a decoupled prototype learning method that adaptively adjusts decision processes based on which modalities are present, outperforming existing methods on multimodal datasets with missing modalities.

DetailsMotivation: Current vision-language transformers suffer significant performance drops when input modalities are missing, and existing methods still use the same prediction heads regardless of missing modalities.

Method: DPL uses modality-specific prototypes for each class that are decomposed into image-specific and text-specific components, allowing adaptive decision making based on available modalities.

Result: Extensive experiments on MM-IMDb, UPMC Food-101, and Hateful Memes show DPL outperforms state-of-the-art approaches across all datasets and various missing modality cases.

Conclusion: DPL’s adaptive prototype design effectively handles missing modalities while remaining compatible with existing prompt-based frameworks, demonstrating superior performance over current methods.

Abstract: The performance of Visio-Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missing-aware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. We introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. For each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information actually present. This adaptive design allows DPL to handle inputs with missing modalities more effectively while remaining fully compatible with existing prompt-based frameworks. Extensive experiments on MM-IMDb, UPMC Food-101, and Hateful Memes demonstrate that DPL outperforms state-of-the-art approaches across all widely used multimodal imag-text datasets and various missing cases.

[1122] Rethinking Irregular Time Series Forecasting: A Simple yet Effective Baseline

Xvyuan Liu, Xiangfei Qiu, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Jilin Hu, Bin Yang

Main category: cs.LG

TL;DR: APN is a novel framework for irregular multivariate time series forecasting that uses adaptive patching and time-aware aggregation to transform irregular data into regular representations, achieving state-of-the-art performance with high efficiency.

DetailsMotivation: Irregular multivariate time series forecasting is crucial in domains like healthcare and climate science, but existing methods struggle with data irregularity and are often computationally intensive.

Method: APN framework with Time-Aware Patch Aggregation (TAPA) module that learns dynamic patch boundaries and uses time-aware weighted averaging to regularize irregular sequences, combined with a simple query module and shallow MLP for prediction.

Result: Experimental results show APN outperforms state-of-the-art methods in both efficiency and accuracy across multiple real-world datasets.

Conclusion: APN provides an effective and efficient solution for irregular multivariate time series forecasting by transforming irregular data into regular representations through adaptive patching.

Abstract: The forecasting of irregular multivariate time series (IMTS) is crucial in key areas such as healthcare, biomechanics, climate science, and astronomy. However, achieving accurate and practical predictions is challenging due to two main factors. First, the inherent irregularity and data missingness in irregular time series make modeling difficult. Second, most existing methods are typically complex and resource-intensive. In this study, we propose a general framework called APN to address these challenges. Specifically, we design a novel Time-Aware Patch Aggregation (TAPA) module that achieves adaptive patching. By learning dynamically adjustable patch boundaries and a time-aware weighted averaging strategy, TAPA transforms the original irregular sequences into high-quality, regularized representations in a channel-independent manner. Additionally, we use a simple query module to effectively integrate historical information while maintaining the model’s efficiency. Finally, predictions are made by a shallow MLP. Experimental results on multiple real-world datasets show that APN outperforms existing state-of-the-art methods in both efficiency and accuracy.

[1123] Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

Muquan Yu, Mu Nan, Hossein Adeli, Jacob S. Prince, John A. Pyles, Leila Wehbe, Margaret M. Henderson, Michael J. Tarr, Andrew F. Luo

Main category: cs.LG

TL;DR: BraInCoRL uses in-context learning with transformers to predict neural responses from few-shot examples, outperforming existing methods in low-data regimes and generalizing across subjects and datasets without finetuning.

DetailsMotivation: Current methods for modeling visual cortex rely on expensive individual-level fMRI data, limiting generalizability to new subjects and stimuli. There's a need for more efficient approaches that can work with limited data.

Method: Transformer architecture that flexibly conditions on variable numbers of in-context image stimuli, jointly learning from image features and voxel activations. Explicitly optimized for in-context learning across multiple subjects.

Result: Consistently outperforms existing voxelwise encoders in low-data regimes, shows strong test-time scaling, generalizes to new fMRI datasets with different subjects and acquisition parameters, and enables interpretable mappings from natural language to voxel selectivity.

Conclusion: BraInCoRL provides an effective framework for few-shot neural response prediction that eliminates the need for expensive individual data collection while improving performance and interpretability in modeling higher visual cortex.

Abstract: Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

[1124] Towards Identifiability of Interventional Stochastic Differential Equations

Aaron Zweig, Zaikang Lin, Elham Azizi, David Knowles

Main category: cs.LG

TL;DR: This paper studies identifiability of SDE parameters under multiple interventions, providing first provable bounds for unique parameter recovery from stationary distributions.

DetailsMotivation: To establish theoretical foundations for recovering SDE parameters from observational data under interventions, addressing identifiability challenges in stochastic dynamical systems.

Method: Developed theoretical bounds on intervention requirements for linear SDEs and upper bounds for nonlinear SDEs in small noise regime; validated with synthetic data experiments and applied to gene regulatory dynamics.

Result: Achieved tight bounds on necessary interventions for linear SDEs and upper bounds for nonlinear cases; successfully recovered true parameters in synthetic experiments and demonstrated advantages of learnable activation functions.

Conclusion: The work provides the first provable identifiability bounds for SDEs under interventions, with practical applications in gene regulatory network modeling using parameterizations with learnable activation functions.

Abstract: We study identifiability of stochastic differential equations (SDE) under multiple interventions. Our results give the first provable bounds for unique recovery of SDE parameters given samples from their stationary distributions. We give tight bounds on the number of necessary interventions for linear SDEs, and upper bounds for nonlinear SDEs in the small noise regime. We experimentally validate the recovery of true parameters in synthetic data, and motivated by our theoretical results, demonstrate the advantage of parameterizations with learnable activation functions in application to gene regulatory dynamics.

[1125] The Third Pillar of Causal Analysis? A Measurement Perspective on Causal Representations

Dingling Yao, Shimeng Huang, Riccardo Cadei, Kun Zhang, Francesco Locatello

Main category: cs.LG

TL;DR: The paper proposes a measurement model framework for causal representation learning, introducing T-MEX score to evaluate learned representations for causal downstream tasks.

DetailsMotivation: Current causal representation learning methods lack clear understanding of what makes representations useful for causal tasks and how to properly evaluate them, especially with complex real-world data.

Method: Reinterpret causal representation learning using a measurement model framework where learned representations are proxy measurements of latent causal variables, and develop T-MEX score for quantitative assessment.

Result: T-MEX score effectively assesses representation quality across diverse scenarios including numerical simulations and real-world ecological video analysis, demonstrating usefulness for causal downstream tasks.

Conclusion: The proposed framework provides principled conditions for when learned representations support causal reasoning and offers a practical evaluation metric for causal representation learning.

Abstract: Causal reasoning and discovery, two fundamental tasks of causal analysis, often face challenges in applications due to the complexity, noisiness, and high-dimensionality of real-world data. Despite recent progress in identifying latent causal structures using causal representation learning (CRL), what makes learned representations useful for causal downstream tasks and how to evaluate them are still not well understood. In this paper, we reinterpret CRL using a measurement model framework, where the learned representations are viewed as proxy measurements of the latent causal variables. Our approach clarifies the conditions under which learned representations support downstream causal reasoning and provides a principled basis for quantitatively assessing the quality of representations using a new Test-based Measurement EXclusivity (T-MEX) score. We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks.

[1126] NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation

Yuan Gao, Hao Wu, Fan Xu, Yanfei Xiang, Ruijian Gou, Ruiqi Shu, Qingsong Wen, Xian Wu, Kun Wang, Xiaomeng Huang

Main category: cs.LG

TL;DR: NeuralOM is a neural operator framework for simulating slow-changing physical systems like oceans and climate, using progressive residual correction and physics-guided graph networks to prevent error accumulation and maintain physical consistency.

DetailsMotivation: Traditional autoregressive ML models fail for slow-changing physical systems due to error accumulation leading to forecast degradation, requiring a more stable and physically consistent approach.

Method: Two key innovations: Progressive Residual Correction Framework for fine-grained refinement steps, and Physics-Guided Graph Network with adaptive messaging to model multi-scale physical interactions like gradient-driven flows.

Result: NeuralOM outperforms state-of-the-art models in global Subseasonal-to-Seasonal ocean simulation, achieving 13.3% lower RMSE at 60-day lead time and excelling in simulating extreme events.

Conclusion: NeuralOM provides a stable, efficient, and physically-aware paradigm for data-driven scientific computing of slow-changing physical systems.

Abstract: Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM’s core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing. Code link: https://github.com/YuanGao-YG/NeuralOM.

[1127] Associative Memory and Generative Diffusion in the Zero-noise Limit

Joshua Hess, Quaid Morris

Main category: cs.LG

TL;DR: This paper establishes that generative diffusion processes converge to associative memory systems at low noise levels, revealing universal properties of memory and generation dynamics through Morse-Smale dynamical systems theory.

DetailsMotivation: To understand the fundamental relationship between generative diffusion models and associative memory systems, and characterize their stability, memorization, and generation dynamics across different model formulations.

Method: Theoretical analysis using Morse-Smale dynamical systems as universal approximators of associative memory models, with diffusion processes as their white-noise perturbations. Framework applies to energy-based models, denoising diffusion models, and Hopfield networks.

Result: Shows that associative memory models exhibit a generic transition from generation to memory as noise diminishes. Morse-Smale flows provide structural stability, with bifurcation theory governing transitions between stable systems. Derived structural stability criteria for Hopfield networks.

Conclusion: The geometric framework provides unified insight into classification, stability, and emergence of memory and generative landscapes across various model formulations, revealing robust bifurcation sequences that govern system transitions.

Abstract: This paper shows that generative diffusion processes converge to associative memory systems at vanishing noise levels and characterizes the stability, robustness, memorization, and generation dynamics of both model classes. Morse-Smale dynamical systems are shown to be universal approximators of associative memory models, with diffusion processes as their white-noise perturbations. The universal properties of associative memory that follow are used to characterize a generic transition from generation to memory as noise diminishes. Structural stability of Morse-Smale flows – that is, the robustness of their global critical point structure – implies the stability of both trajectories and invariant measures for diffusions in the zero-noise limit. The learning and generation landscapes of these models appear as parameterized families of gradient flows and their stochastic perturbations, and the bifurcation theory for Morse-Smale systems implies that they are generically stable except at isolated parameter values, where enumerable sets of local and global bifurcations govern transitions between stable systems in parameter space. These landscapes are thus characterized by ordered bifurcation sequences that create, destroy, or alter connections between rest points and are robust under small stochastic or deterministic perturbations. The framework is agnostic to model formulation, which we verify with examples from energy-based models, denoising diffusion models, and classical and modern Hopfield networks. We additionally derive structural stability criteria for Hopfield-type networks and find that simple cases violate them. Collectively, our geometric approach provides insight into the classification, stability, and emergence of memory and generative landscapes.

[1128] Machine Unlearning of Traffic State Estimation and Prediction

Xin Wang, R. Tyrrell Rockafellar, Xuegang, Ban

Main category: cs.LG

TL;DR: This paper introduces Machine Unlearning TSEP, a new paradigm that enables traffic state estimation and prediction models to selectively forget sensitive, poisoned, or outdated data to address privacy and data freshness concerns.

DetailsMotivation: Data-driven TSEP relies on sensitive data, raising privacy, cybersecurity, and data freshness concerns that can erode public trust. Regulations like the 'right to be forgotten' require models to forget private data upon request.

Method: The study proposes a novel learning paradigm called Machine Unlearning TSEP that enables trained TSEP models to selectively forget specific data.

Result: The approach allows models to effectively unlearn privacy-sensitive, poisoned, or outdated data.

Conclusion: Machine unlearning enhances the trustworthiness and reliability of data-driven traffic state estimation and prediction systems by addressing privacy concerns and enabling compliance with data removal regulations.

Abstract: Data-driven traffic state estimation and prediction (TSEP) relies heavily on data sources that contain sensitive information. While the abundance of data has fueled significant breakthroughs, particularly in machine learning-based methods, it also raises concerns regarding privacy, cybersecurity, and data freshness. These issues can erode public trust in intelligent transportation systems. Recently, regulations have introduced the “right to be forgotten”, allowing users to request the removal of their private data from models. As machine learning models can remember old data, simply removing it from back-end databases is insufficient in such systems. To address these challenges, this study introduces a novel learning paradigm for TSEP-Machine Unlearning TSEP-which enables a trained TSEP model to selectively forget privacy-sensitive, poisoned, or outdated data. By empowering models to “unlearn,” we aim to enhance the trustworthiness and reliability of data-driven traffic TSEP.

[1129] Private Evolution Converges

Tomás González, Giulia Fanti, Aaditya Ramdas

Main category: cs.LG

TL;DR: PE is a training-free DP synthetic data generation method with inconsistent performance across domains. This work provides a new theoretical framework proving PE’s convergence under realistic conditions, achieving expected 1-Wasserstein distance of O~(d(nε)^{-1/d}) for d-dimensional datasets.

DetailsMotivation: PE shows strong performance in some domains but inconsistent behavior in others, and existing theoretical analysis relies on unrealistic assumptions about algorithm behavior and dataset structure.

Method: Developed a new theoretical framework to understand PE’s practical behavior, identified sufficient conditions for convergence, and analyzed PE under the Gaussian variation API with proper hyperparameter settings.

Result: Proved that PE produces (ε,δ)-DP synthetic dataset with expected 1-Wasserstein distance O~(d(nε)^{-1/d}) from original data for d-dimensional datasets from convex compact domains, establishing worst-case convergence as n→∞.

Conclusion: The theoretical analysis extends to general Banach spaces, connects PE to Private Signed Measure Mechanism, and demonstrates practical relevance through experiments.

Abstract: Private Evolution (PE) is a promising training-free method for differentially private (DP) synthetic data generation. While it achieves strong performance in some domains (e.g., images and text), its behavior in others (e.g., tabular data) is less consistent. To date, the only theoretical analysis of the convergence of PE depends on unrealistic assumptions about both the algorithm’s behavior and the structure of the sensitive dataset. In this work, we develop a new theoretical framework to understand PE’s practical behavior and identify sufficient conditions for its convergence. For $d$-dimensional sensitive datasets with $n$ data points from a convex and compact domain, we prove that under the right hyperparameter settings and given access to the Gaussian variation API proposed in \cite{PE23}, PE produces an $(\varepsilon, δ)$-DP synthetic dataset with expected 1-Wasserstein distance $\tilde{O}(d(n\varepsilon)^{-1/d})$ from the original; this establishes worst-case convergence of the algorithm as $n \to \infty$. Our analysis extends to general Banach spaces as well. We also connect PE to the Private Signed Measure Mechanism, a method for DP synthetic data generation that has thus far not seen much practical adoption. We demonstrate the practical relevance of our theoretical findings in experiments.

[1130] PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution

Omkar Shende, Gayathri Ananthanarayanan, Marcello Traiola

Main category: cs.LG

TL;DR: PERTINENCE is a novel online method that dynamically selects the most suitable pre-trained model for each input based on complexity analysis, achieving comparable accuracy with up to 36% fewer operations using genetic algorithm optimization.

DetailsMotivation: Large DNN models are resource-intensive but necessary only for challenging inputs, while lighter models can handle simple ones. There's a need to reduce reliance on large models without significant accuracy degradation.

Method: Uses genetic algorithm to train an ML-based input dispatcher that analyzes input complexity and dynamically selects the most suitable model from a pre-trained set, converging towards Pareto-optimal solutions balancing accuracy and efficiency.

Result: Tested on CNNs (CIFAR-10, CIFAR-100) and ViTs (TinyImageNet), PERTINENCE achieves better or comparable accuracy with up to 36% fewer operations compared to state-of-the-art models.

Conclusion: Dynamic model selection based on input complexity enables significant computational efficiency gains while maintaining accuracy, providing better trade-offs between accuracy and operations than existing approaches.

Abstract: Deep neural networks (DNNs) have become ubiquitous thanks to their remarkable ability to model complex patterns across various domains such as computer vision, speech recognition, robotics, etc. While large DNN models are often more accurate than simpler, lightweight models, they are also resource- and energy-hungry. Hence, it is imperative to design methods to reduce reliance on such large models without significant degradation in output accuracy. The high computational cost of these models is often necessary only for a reduced set of challenging inputs, while lighter models can handle most simple ones. Thus, carefully combining properties of existing DNN models in a dynamic, input-based way opens opportunities to improve efficiency without impacting accuracy. In this work, we introduce PERTINENCE, a novel online method designed to analyze the complexity of input features and dynamically select the most suitable model from a pre-trained set to process a given input effectively. To achieve this, we employ a genetic algorithm to explore the training space of an ML-based input dispatcher, enabling convergence towards the Pareto front in the solution space that balances overall accuracy and computational efficiency. We showcase our approach on state-of-the-art Convolutional Neural Networks (CNNs) trained on the CIFAR-10 and CIFAR-100, as well as Vision Transformers (ViTs) trained on TinyImageNet dataset. We report results showing PERTINENCE’s ability to provide alternative solutions to existing state-of-the-art models in terms of trade-offs between accuracy and number of operations. By opportunistically selecting among models trained for the same task, PERTINENCE achieves better or comparable accuracy with up to 36% fewer operations.

[1131] Global Variational Inference Enhanced Robust Domain Adaptation

Lingkun Luo, Shiqiang Hu, Liming Chen

Main category: cs.LG

TL;DR: GVI-DA is a domain adaptation framework that uses variational inference to learn global priors for structure-aware cross-domain alignment, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Existing deep learning domain adaptation methods rely on mini-batch training which limits global distribution modeling, causing unstable alignment and suboptimal generalization.

Method: Uses variational inference to learn continuous, class-conditional global priors, with latent feature reconstruction and global codebook learning with randomized sampling to prevent posterior collapse. Also discards low-confidence pseudo-labels and generates reliable target-domain samples.

Result: Achieves consistent state-of-the-art performance on four benchmarks and thirty-eight domain adaptation tasks. The model’s ELBO is derived and effects of prior continuity, codebook size, and pseudo-label noise tolerance are analyzed.

Conclusion: GVI-DA demonstrates theoretical soundness and practical advantages over diffusion-based generative frameworks, providing robust domain adaptation through global distribution modeling.

Abstract: Deep learning-based domain adaptation (DA) methods have shown strong performance by learning transferable representations. However, their reliance on mini-batch training limits global distribution modeling, leading to unstable alignment and suboptimal generalization. We propose Global Variational Inference Enhanced Domain Adaptation (GVI-DA), a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment. GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling. It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples. Extensive experiments on four benchmarks and thirty-eight DA tasks demonstrate consistent state-of-the-art performance. We also derive the model’s evidence lower bound (ELBO) and analyze the effects of prior continuity, codebook size, and pseudo-label noise tolerance. In addition, we compare GVI-DA with diffusion-based generative frameworks in terms of optimization principles and efficiency, highlighting both its theoretical soundness and practical advantages.

[1132] From Small to Large: A Graph Convolutional Network Approach for Solving Assortment Optimization Problems

Guokai Li, Pin Gao, Stefanus Jasin, Zizhuo Wang

Main category: cs.LG

TL;DR: A graph convolutional network framework for solving constrained assortment optimization problems that generalizes from small to large instances and outperforms existing heuristics.

DetailsMotivation: Assortment optimization is NP-hard and computationally intensive, especially for e-commerce platforms that need to solve thousands of such problems per minute efficiently.

Method: Constructs graph representation of the problem, trains GCN to learn mapping from problem parameters to optimal assortments, and develops three inference policies based on GCN output.

Result: GCN trained on 20-product instances achieves over 85% of optimal revenue on problems with up to 2,000 products within seconds, outperforming existing heuristics in accuracy and efficiency.

Conclusion: The GCN framework effectively solves large-scale assortment optimization problems and extends to settings with unknown choice models using transaction data with similar performance.

Abstract: Assortment optimization seeks to select a subset of substitutable products, subject to constraints, to maximize expected revenue. The problem is NP-hard due to its combinatorial and nonlinear nature and arises frequently in industries such as e-commerce, where platforms must solve thousands of such problems each minute. We propose a graph convolutional network (GCN) framework to efficiently solve constrained assortment optimization problems. Our approach constructs a graph representation of the problem, trains a GCN to learn the mapping from problem parameters to optimal assortments, and develops three inference policies based on the GCN’s output. Owing to the GCN’s ability to generalize across instance sizes, patterns learned from small-scale samples can be transferred to large-scale problems. Numerical experiments show that a GCN trained on instances with 20 products achieves over 85% of the optimal revenue on problems with up to 2,000 products within seconds, outperforming existing heuristics in both accuracy and efficiency. We further extend the framework to settings with an unknown choice model using transaction data and demonstrate similar performance and scalability.

[1133] Robust-Multi-Task Gradient Boosting

Seyedsaman Emami, Gonzalo Martínez-Muñoz, Daniel Hernández-Lobato

Main category: cs.LG

TL;DR: R-MTGB is a robust multi-task gradient boosting framework that handles outlier tasks by learning shared patterns, detecting outliers, and fine-tuning task-specific predictors.

DetailsMotivation: Real-world multi-task learning often involves outlier tasks that don't share beneficial similarities and can deteriorate overall model performance, requiring robust handling of task heterogeneity.

Method: Three sequential blocks: (1) learn shared patterns, (2) partition tasks into outliers/non-outliers with regularization, (3) fine-tune task-specific predictors within gradient boosting framework.

Result: Successfully isolates outliers, transfers knowledge among related tasks, reduces prediction errors for each task individually, and achieves overall performance gains across all tasks.

Conclusion: R-MTGB demonstrates robustness, adaptability, and reliable convergence in challenging multi-task learning environments with outlier tasks.

Abstract: Multi-task learning (MTL) has shown effectiveness in exploiting shared information across tasks to improve generalization. MTL assumes tasks share similarities that can improve performance. In addition, boosting algorithms have demonstrated exceptional performance across diverse learning problems, primarily due to their ability to focus on hard-to-learn instances and iteratively reduce residual errors. This makes them a promising approach for learning multi-task problems. However, real-world MTL scenarios often involve tasks that are not well-aligned (known as outlier or adversarial tasks), which do not share beneficial similarities with others and can, in fact, deteriorate the performance of the overall model. To overcome this challenge, we propose Robust-Multi-Task Gradient Boosting (R-MTGB), a novel boosting framework that explicitly models and adapts to task heterogeneity during training. R-MTGB structures the learning process into three sequential blocks: (1) learning shared patterns, (2) partitioning tasks into outliers and non-outliers with regularized parameters, and (3) fine-tuning task-specific predictors. This architecture enables R-MTGB to automatically detect and penalize outlier tasks while promoting effective knowledge transfer among related tasks. Our method integrates these mechanisms seamlessly within gradient boosting, allowing robust handling of noisy or adversarial tasks without sacrificing accuracy. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that our approach successfully isolates outliers, transfers knowledge, and consistently reduces prediction errors for each task individually, and achieves overall performance gains across all tasks. These results highlight robustness, adaptability, and reliable convergence of R-MTGB in challenging MTL environments.

[1134] State of Health Estimation of Batteries Using a Time-Informed Dynamic Sequence-Inverted Transformer

Janak M. Patel, Milad Ramezankhani, Anirudh Deodhar, Dagnachew Birru

Main category: cs.LG

TL;DR: Proposes TIDSIT, a transformer-based model with continuous time embeddings to handle irregular battery discharge data, achieving 50% error reduction and <0.58% SoH prediction error.

DetailsMotivation: Battery health monitoring is critical for safety and efficiency in electric vehicles and energy storage, but existing ML models struggle with irregular real-world discharge data patterns.

Method: Time-Informed Dynamic Sequence Inverted Transformer (TIDSIT) with continuous time embeddings and temporal attention mechanisms to process variable-length, irregularly sampled discharge sequences without information loss.

Result: Outperforms existing models on NASA dataset with >50% prediction error reduction and SoH error below 0.58%.

Conclusion: TIDSIT effectively handles irregular time-series data for battery health monitoring and shows promise for broader health monitoring applications.

Abstract: The rapid adoption of battery-powered vehicles and energy storage systems over the past decade has made battery health monitoring increasingly critical. Batteries play a central role in the efficiency and safety of these systems, yet they inevitably degrade over time due to repeated charge-discharge cycles. This degradation leads to reduced energy efficiency and potential overheating, posing significant safety concerns. Accurate estimation of a State of Health (SoH) of battery is therefore essential for ensuring operational reliability and safety. Several machine learning architectures, such as LSTMs, transformers, and encoder-based models, have been proposed to estimate SoH from discharge cycle data. However, these models struggle with the irregularities inherent in real-world measurements: discharge readings are often recorded at non-uniform intervals, and the lengths of discharge cycles vary significantly. To address this, most existing approaches extract features from the sequences rather than processing them in full, which introduces information loss and compromises accuracy. To overcome these challenges, we propose a novel architecture: Time-Informed Dynamic Sequence Inverted Transformer (TIDSIT). TIDSIT incorporates continuous time embeddings to effectively represent irregularly sampled data and utilizes padded sequences with temporal attention mechanisms to manage variable-length inputs without discarding sequence information. Experimental results on the NASA battery degradation dataset show that TIDSIT significantly outperforms existing models, achieving over 50% reduction in prediction error and maintaining an SoH prediction error below 0.58%. Furthermore, the architecture is generalizable and holds promise for broader applications in health monitoring tasks involving irregular time-series data.

[1135] What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang

Main category: cs.LG

TL;DR: Two-layer transformers with one attention head per layer can represent any kth-order Markov process for in-context learning, providing the tightest known characterization of transformer depth vs Markov order.

DetailsMotivation: Previous work showed that deeper transformers (3+ layers) are needed for higher-order Markov processes, leaving open whether two-layer transformers with single heads can represent any kth-order Markov process.

Method: Theoretical analysis showing that a two-layer transformer with one head per layer can represent any conditional k-gram, plus analysis of learning dynamics for a simplified first-order Markov chain variant.

Result: Proved that two-layer single-head transformers can indeed represent any kth-order Markov process, establishing the minimal depth requirement for such representations.

Conclusion: Even shallow transformer architectures can exhibit strong in-context learning capabilities on structured sequence modeling tasks, deepening understanding of transformer-based ICL.

Abstract: In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

[1136] Streaming Generated Gaussian Process Experts for Online Learning and Control: Extended Version

Zewen Yang, Dongfa Zhang, Xiaobing Dai, Fengyi Yu, Chi Zhang, Bingkun Huang, Hamid Sadeghian, Sami Haddadin

Main category: cs.LG

TL;DR: SkyGP is a streaming Gaussian Process framework that addresses computational and memory constraints of exact GPs through a bounded set of experts, with variants for accuracy (SkyGP-Dense) or efficiency (SkyGP-Fast).

DetailsMotivation: Exact GPs have cubic computation and quadratic memory complexity for streaming data, limiting scalability in real-time safety-critical systems that require rapid adaptation.

Method: Proposes a streaming kernel-induced progressively generated expert framework (SkyGP) that maintains a bounded set of experts while inheriting exact GP learning guarantees.

Result: SkyGP shows superior performance in benchmarks and real-time control experiments compared to state-of-the-art approaches, with variants balancing accuracy vs efficiency.

Conclusion: SkyGP effectively addresses computational and memory limitations of exact GPs for streaming data while maintaining performance guarantees, making it suitable for real-time applications.

Abstract: Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a streaming kernel-induced progressively generated expert framework of Gaussian processes (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.

[1137] Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

Yihua Wang, Qi Jia, Cong Xu, Feiyu Chen, Yuhan Liu, Haotian Zhang, Liang Jin, Lu Liu, Zhichun Wang

Main category: cs.LG

TL;DR: The paper addresses shortcut learning in multimodal sarcasm detection and proposes a new dataset (MUStARD++^R) and model (MCIB) to improve generalization by focusing on effective modality fusion.

DetailsMotivation: Current multimodal sarcasm detection methods rely on dataset shortcuts rather than extracting genuine sarcasm features, which impairs real-world generalization. Existing modality fusion strategies also have weaknesses for complex emotion recognition.

Method: Constructed MUStARD++^R dataset by removing shortcut signals from MUStARD++, and introduced Multimodal Conditional Information Bottleneck (MCIB) model for efficient multimodal fusion in sarcasm detection.

Result: Experimental results show that MCIB achieves the best performance without relying on shortcut learning, demonstrating improved generalization capabilities.

Conclusion: Focusing on effective modality fusion through the proposed MCIB approach successfully addresses shortcut learning issues in multimodal sarcasm detection and enhances model generalization in real-world scenarios.

Abstract: Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model’s generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++$^{R}$ by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

[1138] STA-GANN: A Valid and Generalizable Spatio-Temporal Kriging Approach

Yujie Li, Zezhi Shao, Chengqing Yu, Tangwen Qian, Zhao Zhang, Yifan Du, Shaoming He, Fei Wang, Yongjun Xu

Main category: cs.LG

TL;DR: STA-GANN is a novel GNN-based spatio-temporal kriging framework that improves pattern validity and generalization through decoupled phase adjustment, dynamic graph modeling, and adversarial transfer learning.

DetailsMotivation: Current models struggle with ensuring valid and generalizable spatio-temporal patterns, particularly in capturing dynamic spatial dependencies, temporal shifts, and optimizing generalizability for unknown sensors in incomplete data scenarios.

Method: STA-GANN integrates three key components: (1) Decoupled Phase Module for timestamp shift adjustment, (2) Dynamic Data-Driven Metadata Graph Modeling for updating spatial relationships using temporal data and metadata, and (3) adversarial transfer learning strategy for generalizability.

Result: Extensive validation across nine datasets from four fields demonstrates superior performance, with theoretical evidence supporting the framework’s effectiveness.

Conclusion: STA-GANN successfully addresses limitations in spatio-temporal kriging by improving pattern validity and generalization through its integrated approach of phase adjustment, dynamic graph modeling, and adversarial learning.

Abstract: Spatio-temporal tasks often encounter incomplete data arising from missing or inaccessible sensors, making spatio-temporal kriging crucial for inferring the completely missing temporal information. However, current models struggle with ensuring the validity and generalizability of inferred spatio-temporal patterns, especially in capturing dynamic spatial dependencies and temporal shifts, and optimizing the generalizability of unknown sensors. To overcome these limitations, we propose Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN), a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization. STA-GANN integrates (i) Decoupled Phase Module that senses and adjusts for timestamp shifts. (ii) Dynamic Data-Driven Metadata Graph Modeling to update spatial relationships using temporal data and metadata; (iii) An adversarial transfer learning strategy to ensure generalizability. Extensive validation across nine datasets from four fields and theoretical evidence both demonstrate the superior performance of STA-GANN.

[1139] Few-shot Class-incremental Fault Diagnosis by Preserving Class-Agnostic Knowledge with Dual-Granularity Representations

Zhendong Yang, Jie Wang, Liansong Zong, Xiaorong Liu, Quan Qian, Shiqian Chen

Main category: cs.LG

TL;DR: Proposes DGGN framework with dual-granularity representations for few-shot class-incremental fault diagnosis, addressing catastrophic forgetting and overfitting through fine-grained and coarse-grained feature streams with cross-attention fusion.

DetailsMotivation: FSC-FD is critical for industrial systems but severely suffers from catastrophic forgetting of old knowledge and overfitting on scarce new data when continuously learning new fault classes with few samples.

Method: Dual-granularity representations: fine-grained stream with Multi-Order Interaction Aggregation for class-specific features, coarse-grained stream for class-agnostic knowledge, fused via multi-semantic cross-attention. Includes Boundary-Aware Exemplar Prioritization and decoupled Balanced Random Forest classifier.

Result: Extensive experiments on TEP benchmark and real-world MFF dataset show superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches.

Conclusion: DGGN effectively addresses catastrophic forgetting and overfitting in few-shot class-incremental fault diagnosis through dual-granularity representation learning and cross-attention guidance.

Abstract: Few-Shot Class-Incremental Fault Diagnosis (FSC-FD), which aims to continuously learn from new fault classes with only a few samples without forgetting old ones, is critical for real-world industrial systems. However, this challenging task severely amplifies the issues of catastrophic forgetting of old knowledge and overfitting on scarce new data. To address these challenges, this paper proposes a novel framework built upon Dual-Granularity Representations, termed the Dual-Granularity Guidance Network (DGGN). Our DGGN explicitly decouples feature learning into two parallel streams: 1) a fine-grained representation stream, which utilizes a novel Multi-Order Interaction Aggregation module to capture discriminative, class-specific features from the limited new samples. 2) a coarse-grained representation stream, designed to model and preserve general, class-agnostic knowledge shared across all fault types. These two representations are dynamically fused by a multi-semantic cross-attention mechanism, where the stable coarse-grained knowledge guides the learning of fine-grained features, preventing overfitting and alleviating feature conflicts. To further mitigate catastrophic forgetting, we design a Boundary-Aware Exemplar Prioritization strategy. Moreover, a decoupled Balanced Random Forest classifier is employed to counter the decision boundary bias caused by data imbalance. Extensive experiments on the TEP benchmark and a real-world MFF dataset demonstrate that our proposed DGGN achieves superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches. Our code is publicly available at https://github.com/MentaY/DGGN

[1140] PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations

Tim Kreuzer, Jelena Zdravkovic, Panagiotis Papapetrou

Main category: cs.LG

TL;DR: PAX-TS is a model-agnostic post-hoc algorithm for explaining time series forecasting models using localized input perturbations, providing multi-granular explanations and capturing cross-channel correlations in multivariate forecasts.

DetailsMotivation: Modern forecasting models (transformers, LLMs) are opaque and lack explanations, while existing methods like LIME are unsuitable for forecasting contexts, creating a need for specialized time series explainability.

Method: Based on localized input perturbations, PAX-TS generates multi-granular explanations and characterizes cross-channel correlations for multivariate time series through algorithmic procedures tested on 7 algorithms and 10 datasets.

Result: PAX-TS effectively captures model behavior by showing different explanations for high/low-performing algorithms, identifies 6 performance-indicating pattern classes from correlation matrices, and demonstrates cross-channel correlation handling in multivariate cases.

Conclusion: PAX-TS enables detailed illustration of forecasting model mechanisms at different granularities, providing practical explanations that can answer forecast-related questions and reveal model behavior patterns.

Abstract: Time series forecasting has seen considerable improvement during the last years, with transformer models and large language models driving advancements of the state of the art. Modern forecasting models are generally opaque and do not provide explanations for their forecasts, while well-known post-hoc explainability methods like LIME are not suitable for the forecasting context. We propose PAX-TS, a model-agnostic post-hoc algorithm to explain time series forecasting models and their forecasts. Our method is based on localized input perturbations and results in multi-granular explanations. Further, it is able to characterize cross-channel correlations for multivariate time series forecasts. We clearly outline the algorithmic procedure behind PAX-TS, demonstrate it on a benchmark with 7 algorithms and 10 diverse datasets, compare it with two other state-of-the-art explanation algorithms, and present the different explanation types of the method. We found that the explanations of high-performing and low-performing algorithms differ on the same datasets, highlighting that the explanations of PAX-TS effectively capture a model’s behavior. Based on time step correlation matrices resulting from the benchmark, we identify 6 classes of patterns that repeatedly occur across different datasets and algorithms. We found that the patterns are indicators of performance, with noticeable differences in forecasting error between the classes. Lastly, we outline a multivariate example where PAX-TS demonstrates how the forecasting model takes cross-channel correlations into account. With PAX-TS, time series forecasting models’ mechanisms can be illustrated in different levels of detail, and its explanations can be used to answer practical questions on forecasts.

[1141] Neutron Reflectometry by Gradient Descent

Max D. Champneys, Andrew J. Parnell, Philipp Gutfreund, Maximilian W. A. Skoda, . Patrick A. Fairclough, Timothy J. Rogers, Stephanie L. Burg

Main category: cs.LG

TL;DR: A novel gradient-based optimization approach for neutron reflectometry data analysis using automatic differentiation to enable efficient parameter estimation without losing physical intuition.

DetailsMotivation: Traditional neutron reflectometry analysis requires solving inverse problems that are inefficient for complex structures, and while machine learning surrogates exist, they lose physical intuition by replacing governing equations.

Method: Uses automatic differentiation to compute exact gradients of the error function with respect to parameters, enabling gradient descent optimization directly on the forward reflection model.

Result: Demonstrated state-of-the-art performance on oxide quartz films and robust co-fitting for complex organic LED multilayer devices, with open-source differentiable reflectometry kernels provided.

Conclusion: Gradient-based approaches using automatic differentiation offer efficient, physically intuitive optimization for neutron reflectometry that can leverage modern optimization techniques previously unexploited in this field.

Abstract: Neutron reflectometry (NR) is a powerful technique to probe surfaces and interfaces. NR is inherently an indirect measurement technique, access to the physical quantities of interest (layer thickness, scattering length density, roughness), necessitate the solution of an inverse modelling problem, that is inefficient for large amounts of data or complex multiplayer structures (e.g. lithium batteries / electrodes). Recently, surrogate machine learning models have been proposed as an alternative to existing optimisation routines. Although such approaches have been successful, physical intuition is lost when replacing governing equations with fast neural networks. Instead, we propose a novel and efficient approach; to optimise reflectivity data analysis by performing gradient descent on the forward reflection model itself. Herein, automatic differentiation techniques are used to evaluate exact gradients of the error function with respect to the parameters of interest. Access to these quantities enables users of neutron reflectometry to harness a host of powerful modern optimisation and inference techniques that remain thus far unexploited in the context of neutron reflectometry. This paper presents two benchmark case studies; demonstrating state-of-the-art performance on a thick oxide quartz film, and robust co-fitting performance in the high complexity regime of organic LED multilayer devices. Additionally, we provide an open-source library of differentiable reflectometry kernels in the python programming language so that gradient based approaches can readily be applied to other NR datasets.

[1142] A Comparative Benchmark of Federated Learning Strategies for Mortality Prediction on Heterogeneous and Imbalanced Clinical Data

Rodrigo Tertulino

Main category: cs.LG

TL;DR: Comparative benchmark of 5 federated learning strategies for mortality prediction using MIMIC-IV data under non-IID and imbalanced conditions, showing FedProx outperforms others with highest F1-Score.

DetailsMotivation: Address data privacy constraints and statistical heterogeneity in clinical data while developing machine learning models for in-hospital mortality prediction using federated learning.

Method: Simulated realistic non-IID environment by partitioning MIMIC-IV data by clinical care unit, applied SMOTE-Tomek for class imbalance, compared FedAvg, FedProx, FedAdagrad, FedAdam, and FedCluster over 50 communication rounds.

Result: FedProx consistently outperformed other methods with highest F1-Score of 0.8831 and stable convergence, while FedAvg was most computationally efficient but had substantially lower predictive performance.

Conclusion: Regularization-based FL algorithms like FedProx offer more robust and effective solution for heterogeneous and imbalanced clinical prediction tasks than standard or server-side adaptive aggregation methods.

Abstract: Machine learning models hold significant potential for predicting in-hospital mortality, yet data privacy constraints and the statistical heterogeneity of real-world clinical data often hamper their development. Federated Learning (FL) offers a privacy-preserving solution, but its performance under non-Independent and Identically Distributed (non-IID) and imbalanced conditions requires rigorous investigation. The study presents a comparative benchmark of five federated learning strategies: FedAvg, FedProx, FedAdagrad, FedAdam, and FedCluster for mortality prediction. Using the large-scale MIMIC-IV dataset, we simulate a realistic non-IID environment by partitioning data by clinical care unit. To address the inherent class imbalance of the task, the SMOTE-Tomek technique is applied to each client’s local training data. Our experiments, conducted over 50 communication rounds, reveal that the regularization-based strategy, FedProx, consistently outperformed other methods, achieving the highest F1-Score of 0.8831 while maintaining stable convergence. While the baseline FedAvg was the most computationally efficient, its predictive performance was substantially lower. Our findings indicate that regularization-based FL algorithms like FedProx offer a more robust and effective solution for heterogeneous and imbalanced clinical prediction tasks than standard or server-side adaptive aggregation methods. The work provides a crucial empirical benchmark for selecting appropriate FL strategies for real-world healthcare applications.

[1143] Highly Imbalanced Regression with Tabular Data in SEP and Other Applications

Josias K. Moukpe, Philip K. Chan, Ming Zhang

Main category: cs.LG

TL;DR: CISIR is a novel method for highly imbalanced regression that incorporates correlation analysis, Monotonically Decreasing Involution importance functions, and stratified sampling to improve prediction accuracy for rare instances in datasets with imbalance ratios over 1,000.

DetailsMotivation: Address the challenges of highly imbalanced regression where MSE loss ignores correlation, typical importance functions are limited to convex forms, and uniform sampling fails to capture rare instances - particularly important for applications like forecasting rare Solar Energetic Particle events.

Method: Proposes CISIR framework with three key components: correlation analysis between predicted and actual values, Monotonically Decreasing Involution (MDI) importance functions that overcome convex limitations, and stratified sampling to ensure rare instances are represented in mini-batches.

Result: Experimental results on five datasets show CISIR achieves lower error and higher correlation than recent methods. Adding the correlation component to other methods improves their performance, and MDI importance outperforms other importance functions.

Conclusion: CISIR effectively addresses highly imbalanced regression problems by combining correlation consideration, flexible importance functions, and improved sampling strategies, demonstrating superior performance in predicting rare target values.

Abstract: We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 (“highly imbalanced”). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in https://github.com/Machine-Earning/CISIR.

[1144] Self-Supervised Learning of Graph Representations for Network Intrusion Detection

Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen

Main category: cs.LG

TL;DR: GraphIDS is a self-supervised intrusion detection model that unifies representation learning and anomaly detection using graph neural networks and masked autoencoders to identify network intrusions through reconstruction errors.

DetailsMotivation: Existing graph-based intrusion detection methods decouple representation learning from anomaly detection, limiting the effectiveness of embeddings for identifying attacks in evolving network environments.

Method: Uses an inductive graph neural network to embed network flows with local topological context, combined with a Transformer-based encoder-decoder that reconstructs embeddings via masked autoencoding to learn global co-occurrence patterns.

Result: Achieves up to 99.98% PR-AUC and 99.61% macro F1-score on NetFlow benchmarks, outperforming baselines by 5-25 percentage points.

Conclusion: The end-to-end framework successfully unifies representation learning and anomaly detection, enabling effective intrusion detection through reconstruction-based anomaly scoring.

Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.

[1145] Active Learning for Machine Learning Driven Molecular Dynamics

Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu

Main category: cs.LG

TL;DR: Active learning framework for coarse-grained neural network potentials that uses RMSD-based frame selection to identify and correct coverage gaps during molecular dynamics simulations, improving model accuracy without requiring extensive all-atom data.

DetailsMotivation: Machine-learned coarse-grained potentials degrade over time when simulations reach under-sampled conformations, and generating widespread all-atom data to address this is computationally infeasible.

Method: Active learning framework using RMSD-based frame selection from MD simulations to generate data on-the-fly by querying an oracle during neural network potential training, preserving CG-level efficiency while correcting model at coverage gaps.

Result: Framework explores previously unseen configurations and trains model on unexplored regions of conformational space. CGSchNet model trained on Chignolin protein achieves 33.05% improvement in Wasserstein-1 metric in TICA space.

Conclusion: The active learning framework enables efficient and accurate coarse-grained neural network potentials by dynamically identifying and correcting coverage gaps during training.

Abstract: Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.

[1146] Understanding Post-Training Structural Changes in Large Language Models

Xinyu He, Xianghui Cao

Main category: cs.LG

TL;DR: SVD analysis reveals post-training causes uniform singular value scaling and coordinated orthogonal transformations in LLM parameters, with orthogonal consistency being crucial for performance.

DetailsMotivation: To understand how post-training methods like instruction tuning and Long-CoT distillation structurally alter LLM parameters, as current understanding of parameter space changes is limited.

Method: Systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on instruction tuning and Long-CoT distillation.

Result: Two consistent structural changes: uniform geometric scaling of singular values across layers (modulating attention scores) and highly consistent orthogonal transformations of singular vectors. Disrupting orthogonal consistency causes catastrophic performance degradation.

Conclusion: Post-training can be interpreted as reparameterization of fixed subspaces, with singular value scaling as secondary effect and coordinated rotation of singular vectors as core functional transformation, revealing clear regularities in parameter evolution.

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two consistent and unexpected structural changes:(1) a near-uniform geometric scaling of singular values across layers, which theoretically modulates attention scores; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Disrupting this orthogonal consistency leads to catastrophic performance degradation. Based on these findings, we propose a simple yet effective framework that interprets post-training as a reparameterization of fixed subspaces in the pretrained parameter space. Further experiments reveal that singular value scaling behaves as a secondary effect, analogous to a temperature adjustment, whereas the core functional transformation lies in the coordinated rotation of singular vectors. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

[1147] HyperCore: Coreset Selection under Noise via Hypersphere Models

Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel

Main category: cs.LG

TL;DR: HyperCore is a robust coreset selection framework that adaptively prunes noisy data using hypersphere models and Youden’s J statistic, outperforming existing methods in noisy environments.

DetailsMotivation: Existing coreset selection methods ignore annotation errors and require fixed pruning ratios, making them impractical for real-world noisy datasets.

Method: Uses lightweight hypersphere models per class to embed in-class samples close to center while segregating out-of-class samples based on distance, with adaptive pruning via Youden’s J statistic.

Result: Consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes, effectively discarding mislabeled and ambiguous points.

Conclusion: HyperCore yields compact yet highly informative subsets suitable for scalable and noise-free learning without requiring hyperparameter tuning.

Abstract: The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden’s J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.

[1148] Fine-Grained Uncertainty Decomposition in Large Language Models: A Spectral Approach

Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke Hüllermeier, Florian Buettner

Main category: cs.LG

TL;DR: Spectral Uncertainty is a novel method using Von Neumann entropy to quantify and decompose predictive uncertainty in LLMs into aleatoric and epistemic components, outperforming existing methods.

DetailsMotivation: As LLMs are increasingly deployed, reliable uncertainty measures are crucial. Distinguishing between aleatoric uncertainty (data ambiguity) and epistemic uncertainty (model limitations) is essential for addressing each source effectively.

Method: Leverages Von Neumann entropy from quantum information theory to separate total uncertainty into aleatoric and epistemic components. Incorporates fine-grained semantic similarity representation for nuanced differentiation of semantic interpretations.

Result: Empirical evaluations show Spectral Uncertainty outperforms state-of-the-art methods in estimating both aleatoric and total uncertainty across diverse models and benchmark datasets.

Conclusion: Spectral Uncertainty provides a rigorous theoretical foundation for uncertainty quantification in LLMs and demonstrates superior performance in uncertainty estimation compared to existing baseline methods.

Abstract: As Large Language Models (LLMs) are increasingly integrated in diverse applications, obtaining reliable measures of their predictive uncertainty has become critically important. A precise distinction between aleatoric uncertainty, arising from inherent ambiguities within input data, and epistemic uncertainty, originating exclusively from model limitations, is essential to effectively address each uncertainty source. In this paper, we introduce Spectral Uncertainty, a novel approach to quantifying and decomposing uncertainties in LLMs. Leveraging the Von Neumann entropy from quantum information theory, Spectral Uncertainty provides a rigorous theoretical foundation for separating total uncertainty into distinct aleatoric and epistemic components. Unlike existing baseline methods, our approach incorporates a fine-grained representation of semantic similarity, enabling nuanced differentiation among various semantic interpretations in model responses. Empirical evaluations demonstrate that Spectral Uncertainty outperforms state-of-the-art methods in estimating both aleatoric and total uncertainty across diverse models and benchmark datasets.

[1149] Energy Guided Geometric Flow Matching

Aaron Zweig, Mingxuan Zhang, Elham Azizi, David Knowles

Main category: cs.LG

TL;DR: The paper proposes using score matching and annealed energy distillation to learn a metric tensor that captures data geometry for more accurate flow matching on temporal data.

DetailsMotivation: Traditional flow matching methods use straight conditional paths and geodesic learning methods rely on RBF kernels or nearest neighbor graphs that suffer from the curse of dimensionality, failing to properly capture data manifold geometry.

Method: Using score matching and annealed energy distillation to learn a metric tensor that faithfully captures the underlying data geometry, which then informs more accurate flows.

Result: The method demonstrates efficacy on synthetic manifolds with analytic geodesics and interpolation of cell data.

Conclusion: The proposed approach of learning metric tensors through score matching and energy distillation provides a better way to capture data geometry for improved flow matching compared to traditional methods.

Abstract: A useful inductive bias for temporal data is that trajectories should stay close to the data manifold. Traditional flow matching relies on straight conditional paths, and flow matching methods which learn geodesics rely on RBF kernels or nearest neighbor graphs that suffer from the curse of dimensionality. We propose to use score matching and annealed energy distillation to learn a metric tensor that faithfully captures the underlying data geometry and informs more accurate flows. We demonstrate the efficacy of this strategy on synthetic manifolds with analytic geodesics, and interpolation of cell

[1150] Echo Flow Networks

Hongbo Liu, Jia Xu

Main category: cs.LG

TL;DR: Echo Flow Networks (EFNs) introduce an enhanced reservoir computing framework with Matrix-Gated Composite Random Activation (MCRA) and dual-stream architecture, achieving state-of-the-art time-series forecasting performance with 4x faster training and 3x smaller models.

DetailsMotivation: To overcome the trade-off between computational complexity and long-range dependency capture in time-series forecasting, while leveraging the efficiency of Echo State Networks but addressing their limited nonlinear capacity and performance limitations.

Method: Framework composed of extended Echo State Networks (X-ESNs) with MLP readouts, enhanced by novel Matrix-Gated Composite Random Activation (MCRA) for complex neuron-specific temporal dynamics, plus a dual-stream architecture that dynamically selects reservoir features from infinite-horizon memory.

Result: Achieves up to 4x faster training and 3x smaller model size compared to PatchTST, reducing forecasting error from 43% to 35% (20% relative improvement). EchoFormer (EFN instantiation) achieves state-of-the-art performance across five benchmark datasets.

Conclusion: EFNs successfully bridge the gap between computational efficiency and forecasting performance, offering a scalable solution for long-term time-series forecasting with superior accuracy and reduced resource requirements.

Abstract: At the heart of time-series forecasting (TSF) lies a fundamental challenge: how can models efficiently and effectively capture long-range temporal dependencies across ever-growing sequences? While deep learning has brought notable progress, conventional architectures often face a trade-off between computational complexity and their ability to retain accumulative information over extended horizons. Echo State Networks (ESNs), a class of reservoir computing models, have recently regained attention for their exceptional efficiency, offering constant memory usage and per-step training complexity regardless of input length. This makes them particularly attractive for modeling extremely long-term event history in TSF. However, traditional ESNs fall short of state-of-the-art performance due to their limited nonlinear capacity, which constrains both their expressiveness and stability. We introduce Echo Flow Networks (EFNs), a framework composed of a group of extended Echo State Networks (X-ESNs) with MLP readouts, enhanced by our novel Matrix-Gated Composite Random Activation (MCRA), which enables complex, neuron-specific temporal dynamics, significantly expanding the network’s representational capacity without compromising computational efficiency. In addition, we propose a dual-stream architecture in which recent input history dynamically selects signature reservoir features from an infinite-horizon memory, leading to improved prediction accuracy and long-term stability. Extensive evaluations on five benchmarks demonstrate that EFNs achieve up to 4x faster training and 3x smaller model size compared to leading methods like PatchTST, reducing forecasting error from 43% to 35%, a 20% relative improvement. One instantiation of our framework, EchoFormer, consistently achieves new state-of-the-art performance across five benchmark datasets: ETTh, ETTm, DMV, Weather, and Air Quality.

[1151] MSCoD: An Enhanced Bayesian Updating Framework with Multi-Scale Information Bottleneck and Cooperative Attention for Structure-Based Drug Design

Long Xu, Yongcai Chen, Fengshuo Liu, Yuzhong Peng

Main category: cs.LG

TL;DR: MSCoD is a Bayesian updating-based generative framework for structure-based drug design that addresses hierarchical protein-ligand interactions through Multi-Scale Information Bottleneck and multi-head cooperative attention mechanisms.

DetailsMotivation: Current SBDD methods struggle to capture complex protein-ligand interactions across multiple scales and often overlook hierarchical organization and intrinsic asymmetry of these interactions.

Method: Proposed MSCoD framework with Multi-Scale Information Bottleneck (MSIB) for hierarchical feature extraction and multi-head cooperative attention (MHCA) for asymmetric protein-to-ligand attention to address dimensionality disparity.

Result: MSCoD outperforms state-of-the-art methods on benchmark datasets, shows real-world applicability on difficult targets like KRAS G12D, and its modules boost performance on drug target affinity prediction benchmarks.

Conclusion: MSCoD effectively captures multi-scale protein-ligand interactions and demonstrates superior performance and transferability in structure-based drug design applications.

Abstract: Structure-Based Drug Design (SBDD) is a powerful strategy in computational drug discovery, utilizing three-dimensional protein structures to guide the design of molecules with improved binding affinity. However, capturing complex protein-ligand interactions across multiple scales remains challenging, as current methods often overlook the hierarchical organization and intrinsic asymmetry of these interactions. To address these limitations, we propose MSCoD, a novel Bayesian updating-based generative framework for structure-based drug design. In our MSCoD, Multi-Scale Information Bottleneck (MSIB) was developed, which enables semantic compression at multiple abstraction levels for efficient hierarchical feature extraction. Furthermore, a multi-head cooperative attention (MHCA) mechanism was developed, which employs asymmetric protein-to-ligand attention to capture diverse interaction types while addressing the dimensionality disparity between proteins and ligands. Empirical studies showed that MSCoD outperforms state-of-the-art methods on the benchmark dataset. Its real-world applicability is confirmed by case studies on difficult targets like KRAS G12D (7XKJ). Additionally, the MSIB and MHCA modules prove transferable, boosting the performance of GraphDTA on standard drug target affinity prediction benchmarks (Davis and Kiba). The code and data underlying this article are freely available at https://github.com/xulong0826/MSCoD.

[1152] Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws

Ramzi Dakhmouche, Hossein Gorji

Main category: cs.LG

TL;DR: The paper identifies a fundamental property that explains why deep learning models struggle with extrapolation in time series forecasting, contrasting with physical laws’ strong extrapolation capabilities.

DetailsMotivation: Foundation Models have succeeded in short-range forecasting but fail at long-range extrapolation, unlike physical laws. The research aims to understand the fundamental differences between neural networks and physical laws in extrapolation.

Method: The authors formalize a fundamental property characterizing statistical learning models’ ability to predict outside their training domain, supported by theoretical analysis and empirical results on current deep learning architectures.

Result: The research clarifies the root causes of the extrapolation gap in deep learning models and demonstrates performance deterioration in extrapolation settings through empirical validation.

Conclusion: The findings not only explain the extrapolation gap but also suggest directions for designing next-generation forecasting models capable of mastering extrapolation beyond training domains.

Abstract: Motivated by the remarkable success of Foundation Models (FMs) in language modeling, there has been growing interest in developing FMs for time series prediction, given the transformative power such models hold for science and engineering. This culminated in significant success of FMs in short-range forecasting settings. However, extrapolation or long-range forecasting remains elusive for FMs, which struggle to outperform even simple baselines. This contrasts with physical laws which have strong extrapolation properties, and raises the question of the fundamental difference between the structure of neural networks and physical laws. In this work, we identify and formalize a fundamental property characterizing the ability of statistical learning models to predict more accurately outside of their training domain, hence explaining performance deterioration for deep learning models in extrapolation settings. In addition to a theoretical analysis, we present empirical results showcasing the implications of this property on current deep learning architectures. Our results not only clarify the root causes of the extrapolation gap but also suggest directions for designing next-generation forecasting models capable of mastering extrapolation.

[1153] Can Linear Probes Measure LLM Uncertainty?

Ramzi Dakhmouche, Adrien Letellier, Hossein Gorji

Main category: cs.LG

TL;DR: Proposes a Bayesian linear regression approach for uncertainty quantification in LLMs, using layer-level posterior distributions to identify sparse combinations of features for efficient UQ.

DetailsMotivation: Current UQ methods for LLMs with multiple choice structure are dominated by naive maximum softmax baseline, lacking principled approaches despite the need for reliable deployment in automated decision-making.

Method: Train multiple Bayesian linear models predicting each layer’s output from the previous layer, then infer global uncertainty by identifying sparse combinations of distributional features from layer-level posterior distributions.

Result: Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines in uncertainty quantification.

Conclusion: A principled Bayesian approach using simple linear regression models can effectively improve uncertainty quantification in LLMs, outperforming existing methods despite model simplicity.

Abstract: Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.

[1154] On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond

Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li

Main category: cs.LG

TL;DR: Masked Diffusion Models (MDM) with sufficient context length are computationally universal with optimal parallel complexity, but their any-order generation doesn’t expand what Auto-Regressive Models (ARM) can solve. The paper proposes any-process generation with remasking, insertion, and deletion capabilities to enable solving harder reasoning problems.

DetailsMotivation: To understand the computational power and limitations of diffusion language models compared to autoregressive models, and to develop capabilities that can solve problems beyond what ARM can handle.

Method: Proposes any-process generation that extends MDM with capabilities to remask, insert and delete tokens, enabling self-correction, length-variable editing, and adaptive parallelism.

Result: Theoretical and empirical demonstrations show that any-process generation enables scalability to significantly harder reasoning problems that are intractable for ARM and vanilla MDM, and proves essential for generation tasks in domains like coding and science.

Conclusion: Any-process generation with remasking, insertion, and deletion capabilities is crucial for extending current LLMs beyond natural language to handle complex reasoning problems and non-sequential processes in domains like coding and science.

Abstract: Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM’s flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.

[1155] On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Zhi Yang, Changwu Huang, Ke Tang, Xin Yao

Main category: cs.LG

TL;DR: Proposes a novel membership inference game for efficient auditing of worst-case group privacy risks and enhances DP-SGD with adaptive group-specific gradient clipping to reduce privacy protection disparities.

DetailsMotivation: Existing methods assess group privacy risks based on average-case scenarios, potentially underestimating disparities. Current worst-case assessment methods are time-consuming and impractical.

Method: 1) Introduces a membership inference game for efficient auditing of approximate worst-case privacy risks. 2) Enhances DP-SGD with adaptive group-specific gradient clipping strategy inspired by differential privacy canaries.

Result: The method provides more stringent measurement of group privacy risks and reliable assessment of disparities. The enhanced DP-SGD algorithm effectively reduces disparity in group privacy risks.

Conclusion: The proposed approach enhances fairness in privacy protection for differentially private machine learning by providing better group privacy risk assessment and reducing disparities through adaptive gradient clipping.

Abstract: While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.

[1156] Instruction Tuning Chronologically Consistent Language Models

Songrun He, Linying Lv, Asaf Manela, Jimmy Wu

Main category: cs.LG

TL;DR: A family of chronologically consistent, instruction-tuned LLMs designed to eliminate lookahead bias by training only on data available before specific cutoff dates.

DetailsMotivation: To address lookahead bias in predictive modeling by ensuring models are trained only on historically available data, preventing training leakage from future information.

Method: Develop instruction-tuned large language models trained exclusively on data available before clearly defined knowledge-cutoff dates, with strict temporal separation from post-cutoff data.

Result: Created models with conversational chat interface, fully open fixed weights for replicability, and conservative lower bound on forecast accuracy by removing training leakage.

Conclusion: The framework provides researchers with an easy-to-use generative AI tool for prediction tasks that is free of lookahead bias, isolating genuine predictability from training artifacts.

Abstract: We introduce a family of chronologically consistent, instruction-tuned large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.

[1157] Learning at the Speed of Physics: Equilibrium Propagation on Oscillator Ising Machines

Alex Gower

Main category: cs.LG

TL;DR: Oscillator Ising Machines (OIMs) use physical energy descent dynamics for machine learning, achieving competitive accuracy on MNIST and Fashion-MNIST while being robust to hardware constraints.

DetailsMotivation: Physical systems that naturally perform energy descent can accelerate machine learning by directly implementing optimization and sampling processes, avoiding bottlenecks of conventional processors.

Method: Equilibrium Propagation (EP) on Oscillator Ising Machines (OIMs) unifies optimization and sampling through descent on a single total energy landscape, enabling local learning rules without global backpropagation.

Result: Achieved competitive accuracy: ~97.2 ± 0.1% on MNIST and ~88.0 ± 0.1% on Fashion-MNIST, while maintaining robustness under hardware constraints like parameter quantization and phase noise.

Conclusion: OIMs establish a fast, energy-efficient substrate for neuromorphic learning, enabling practical realization of energy-based models on physical hardware whose dynamics directly perform optimization.

Abstract: Physical systems that naturally perform energy descent offer a direct route to accelerating machine learning. Oscillator Ising Machines (OIMs) exemplify this idea: their GHz-frequency dynamics mirror both the optimization of energy-based models (EBMs) and gradient descent on loss landscapes, while intrinsic noise corresponds to Langevin dynamics - supporting sampling as well as optimization. Equilibrium Propagation (EP) unifies these processes into descent on a single total energy landscape, enabling local learning rules without global backpropagation. We show that EP on OIMs achieves competitive accuracy ($\sim 97.2 \pm 0.1 %$ on MNIST, $\sim 88.0 \pm 0.1 %$ on Fashion-MNIST), while maintaining robustness under realistic hardware constraints such as parameter quantization and phase noise. These results establish OIMs as a fast, energy-efficient substrate for neuromorphic learning, and suggest that EBMs - often bottlenecked by conventional processors - may find practical realization on physical hardware whose dynamics directly perform their optimization.

[1158] Trace Regularity PINNs: Enforcing $\mathrm{H}^{\frac{1}{2}}(\partial Ω)$ for Boundary Data

Doyoon Kim, Junbin Song

Main category: cs.LG

TL;DR: TRPINN is an enhanced PINN that enforces boundary loss using the Sobolev-Slobodeckij norm H^{1/2}(∂Ω), reducing computational cost and improving convergence stability compared to standard PINNs.

DetailsMotivation: Standard PINNs have limitations in handling boundary conditions, particularly for problems with oscillatory boundary data. The motivation is to develop a more mathematically rigorous and computationally efficient approach to boundary condition enforcement in PINNs.

Method: The method uses the Trace Regularity Physics-Informed Neural Network (TRPINN) which computes only the essential portion of the H^{1/2}(∂Ω) semi-norm, avoiding denominator evaluations in discretization. It incorporates exact H^{1/2}(∂Ω) norm and uses Neural Tangent Kernel analysis.

Result: TRPINN converges to the true solution in H^1(Ω) sense and converges faster than standard PINNs. Numerical experiments on Laplace equation with oscillatory Dirichlet boundary conditions show TRPINN succeeds where standard PINNs fail, with performance improvements of 1-3 decimal digits.

Conclusion: TRPINN provides a mathematically rigorous and computationally efficient framework for boundary condition enforcement in PINNs, offering superior performance and convergence properties compared to standard PINNs, particularly for challenging boundary conditions.

Abstract: We propose an enhanced physics-informed neural network (PINN), the Trace Regularity Physics-Informed Neural Network (TRPINN), which enforces the boundary loss in the Sobolev-Slobodeckij norm $H^{1/2}(\partial Ω)$, the correct trace space associated with $H^1(Ω)$. We reduce computational cost by computing only the theoretically essential portion of the semi-norm and enhance convergence stability by avoiding denominator evaluations in the discretization. By incorporating the exact $H^{1/2}(\partial Ω)$ norm, we show that the approximation converges to the true solution in the $H^{1}(Ω)$ sense, and, through Neural Tangent Kernel (NTK) analysis, we demonstrate that TRPINN can converge faster than standard PINNs. Numerical experiments on the Laplace equation with highly oscillatory Dirichlet boundary conditions exhibit cases where TRPINN succeeds even when standard PINNs fail, and show performance improvements of one to three decimal digits.

[1159] 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

Main category: cs.LG

TL;DR: A 3D optimization framework for AI inference scaling that jointly optimizes accuracy, cost, and latency, outperforming traditional 1D and 2D approaches by enabling constraints-aware deployment.

DetailsMotivation: Traditional AI inference scaling uses 1D heuristics or 2D trade-offs that fail to consider cost and latency constraints, limiting practical deployment effectiveness.

Method: Developed a 3D optimization framework using Monte Carlo simulations across three scenarios and nine LLMs, evaluating four optimization methods for multi-objective optimization of accuracy, cost, and latency.

Result: Knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization works best when accuracy is prioritized. Smaller models with optimal inference scaling can match larger models at lower cost.

Conclusion: The framework provides a theoretical foundation for deployment-aware inference scaling, enabling environment-adaptive selection of inference parameters across diverse operational conditions.

Abstract: AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.

[1160] Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Larkin Liu, Jalal Etesami

Main category: cs.LG

TL;DR: The paper introduces online mixture-of-experts (OMoE) for expert-guided bandit learning, proposing two algorithms that combine expert outputs to optimize aggregate accuracy with theoretical regret guarantees and applications to LLM fine-tuning.

DetailsMotivation: To improve aggregate accuracy by optimally combining multiple experts' outputs in a bandit learning setting, particularly for modern applications like fine-tuning large language models.

Method: Two algorithms: (1) aggregate voting with UCB-driven successive elimination for pruning suboptimal actions, and (2) online weighted-majority-voting that leverages each expert’s predictive power proportionally.

Result: Theoretical guarantees for regret properties in bandit settings under ideal circumstances, with empirical validation. Applied successfully to online fine-tuning of expert LLMs for improved response accuracy.

Conclusion: The methods provide new methodologies and no-regret guarantees for combining multiple experts to enhance overall aggregate model performance, particularly in dynamic expert selection and weighting scenarios.

Abstract: We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.

[1161] A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

Rodrigo Tertulino, Ricardo Almeida

Main category: cs.LG

TL;DR: A multi-level ML approach using Random Forest achieved 90.2% accuracy in classifying Brazilian student proficiency, with XAI revealing school socioeconomic level as the most dominant predictor.

DetailsMotivation: To identify factors influencing student performance in Brazilian basic education for effective public policy formulation.

Method: Multi-level machine learning approach integrating student socioeconomic data, teacher profiles, school indicators, and principal management profiles using ensemble algorithms and SHAP for explainability.

Result: Random Forest model achieved 90.2% accuracy and 96.7% AUC, with SHAP analysis showing school’s average socioeconomic level as the most dominant predictor.

Conclusion: Academic performance is a systemic phenomenon tied to the school’s ecosystem, requiring policies that address disparities between schools rather than focusing only on individual characteristics.

Abstract: Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school’s average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school’s ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.

[1162] Integrating Genomics into Multimodal EHR Foundation Models

Jonathan Amar, Edward Liu, Alessandra Breschi, Liangliang Zhang, Pouya Kheradpour, Sylvia Li, Lisa Soleymani Lehmann, Alessandro Giulianelli, Matt Edwards, Yugang Jia, David Nola, Raghav Mani, Pankaj Vats, Jesse Tetreault, T. J. Chen, Cory Y. McLean

Main category: cs.LG

TL;DR: An EHR foundation model that integrates Polygenic Risk Scores (PRS) with clinical data from the All of Us program, using multimodal AI to improve disease prediction and enable personalized healthcare.

DetailsMotivation: To move beyond traditional EHR-only approaches by incorporating genetic predisposition data (PRS) for more holistic health profiles and better disease prediction capabilities.

Method: Extends generative AI to EHR foundation models, creating a multimodal framework that learns complex relationships between clinical EHR data and genetic predispositions using All of Us program data.

Result: Demonstrated predictive value for various conditions, especially Type 2 Diabetes, and showed interplay between PRS and EHR data. Successfully applied transfer learning for custom classification tasks.

Conclusion: This integrated approach enables better disease prediction, proactive health management, risk stratification, and personalized treatment strategies, advancing personalized and equitable healthcare evidence generation.

Abstract: This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model’s predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture’s versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.

[1163] Geometric Algorithms for Neural Combinatorial Optimization with Constraints

Nikolaos Karalias, Akbar Rafiey, Yifei Xu, Zhishang Luo, Behrooz Tahmasebi, Connie Jiang, Stefanie Jegelka

Main category: cs.LG

TL;DR: A self-supervised learning framework for combinatorial optimization that handles discrete constraints by decomposing neural network outputs into convex combinations of feasible solutions using convex geometry techniques.

DetailsMotivation: To address the challenge of solving combinatorial optimization problems with discrete constraints using self-supervised learning, where traditional neural network approaches struggle with discrete feasibility.

Method: Leverage convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners representing feasible sets, enabling end-to-end differentiable optimization with quality-preserving rounding.

Result: Extensive experiments show consistent outperformance over neural baselines in cardinality-constrained optimization, with successful applications to independent sets in graphs and matroid-constrained problems.

Conclusion: The proposed decomposition-based approach provides an effective framework for self-supervised learning in combinatorial optimization that handles discrete constraints while maintaining solution quality through proper rounding techniques.

Abstract: Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.

[1164] CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Xuchen Feng, Siyu Liao

Main category: cs.LG

TL;DR: Proposes a novel invertible linear layer using circulant-diagonal matrix products for normalizing flows, achieving efficient computation while maintaining expressiveness.

DetailsMotivation: To design linear layers for normalizing flows that are both expressive and computationally efficient, addressing the high parameter and computational costs of traditional linear transformations.

Method: Decomposes linear transformations into products of circulant and diagonal matrices, leveraging Fast Fourier Transform for efficient inversion and log-determinant computation.

Result: Reduces parameter complexity from O(n²) to O(mn), matrix inversion from O(n³) to O(mn log n), and log-determinant computation from O(n³) to O(mn). Achieves strong density estimation on natural images and excels with periodic data.

Conclusion: Circulant-Diagonal Flow provides an efficient and expressive alternative for normalizing flows, enabling scalable generative modeling with significant computational advantages.

Abstract: Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.

[1165] Towards Causal Market Simulators

Dennis Thumm, Luis Ontaneda Mijares

Main category: cs.LG

TL;DR: Proposes TNCM-VAE, a neural causal model combining VAEs with structural causal models to generate counterfactual financial time series with preserved temporal and causal relationships.

DetailsMotivation: Existing market generators lack causal reasoning capabilities needed for counterfactual analysis and risk assessment in financial applications.

Method: Combines variational autoencoders with structural causal models, enforces causal constraints via DAGs in decoder architecture, and uses causal Wasserstein distance for training.

Result: Validated on synthetic autoregressive models, achieving L1 distances of 0.03-0.10 for counterfactual probability estimation compared to ground truth.

Conclusion: Enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect causal mechanisms.

Abstract: Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

[1166] Uncertainty Quantification for Reduced-Order Surrogate Models Applied to Cloud Microphysics

Jonas E. Katona, Emily K. de Jong, Nipun Gunawardena

Main category: cs.LG

TL;DR: A post hoc, model-agnostic framework for predictive uncertainty quantification in latent space reduced-order models using conformal prediction.

DetailsMotivation: Existing uncertainty quantification methods for ROMs are architecture- or training-specific, limiting flexibility and generalization.

Method: Uses conformal prediction to estimate statistical prediction intervals for latent dynamics, reconstruction, and end-to-end predictions without modifying the underlying ROM architecture or training.

Result: Successfully demonstrated on a latent space dynamical model for cloud microphysics, accurately predicting droplet-size distribution evolution and quantifying uncertainty across the ROM pipeline.

Conclusion: The proposed framework provides robust, flexible uncertainty quantification for ROMs that works with any architecture and training procedure.

Abstract: Reduced-order models (ROMs) can efficiently simulate high-dimensional physical systems but lack robust uncertainty quantification methods. Existing approaches are frequently architecture- or training-specific, which limits flexibility and generalization. We introduce a post hoc, model-agnostic framework for predictive uncertainty quantification in latent space ROMs that requires no modification to the underlying architecture or training procedure. Using conformal prediction, our approach estimates statistical prediction intervals for multiple components of the ROM pipeline: latent dynamics, reconstruction, and end-to-end predictions. We demonstrate the method on a latent space dynamical model for cloud microphysics, where it accurately predicts the evolution of droplet-size distributions and quantifies uncertainty across the ROM pipeline.

[1167] Ada-FCN: Adaptive Frequency-Coupled Network for fMRI-Based Brain Disorder Classification

Yue Xun, Jiaxing Xu, Wenbo Gao, Chen Yang, Shujun Wang

Main category: cs.LG

TL;DR: A novel fMRI analysis framework that adaptively learns frequency sub-bands and captures cross-band interactions for improved brain disorder classification.

DetailsMotivation: Existing fMRI models treat BOLD signals as monolithic time series, ignoring the multi-frequency nature of neuronal oscillations and disease-specific frequency disruptions, limiting diagnostic accuracy.

Method: Proposes Adaptive Cascade Decomposition to learn task-relevant frequency sub-bands per brain region, and Frequency-Coupled Connectivity Learning to capture intra- and cross-band interactions in a unified functional network, used in Unified-GCN for message passing.

Result: Demonstrates superior performance on ADNI and ABIDE datasets compared to existing methods.

Conclusion: The framework effectively addresses frequency-specific neurological disruptions and provides improved diagnostic sensitivity and specificity for brain disorders.

Abstract: Resting-state fMRI has become a valuable tool for classifying brain disorders and constructing brain functional connectivity networks by tracking BOLD signals across brain regions. However, existing mod els largely neglect the multi-frequency nature of neuronal oscillations, treating BOLD signals as monolithic time series. This overlooks the cru cial fact that neurological disorders often manifest as disruptions within specific frequency bands, limiting diagnostic sensitivity and specificity. While some methods have attempted to incorporate frequency informa tion, they often rely on predefined frequency bands, which may not be optimal for capturing individual variability or disease-specific alterations. To address this, we propose a novel framework featuring Adaptive Cas cade Decomposition to learn task-relevant frequency sub-bands for each brain region and Frequency-Coupled Connectivity Learning to capture both intra- and nuanced cross-band interactions in a unified functional network. This unified network informs a novel message-passing mecha nism within our Unified-GCN, generating refined node representations for diagnostic prediction. Experimental results on the ADNI and ABIDE datasets demonstrate superior performance over existing methods. The code is available at https://github.com/XXYY20221234/Ada-FCN.

[1168] SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices

Liu Jiang, Zerui Bao, Shiqi Sheng, Di Zhu

Main category: cs.LG

TL;DR: A new instruction prefetching design for cloud workloads that improves efficiency and reduces tail latency through compressed entries, hierarchical metadata storage, and online ML control.

DetailsMotivation: Large-scale networked services with deep software stacks and microservice orchestration create frontend stalls that inflate tail latency and energy consumption, requiring improved instruction prefetching solutions.

Method: Builds on Entangling Instruction Prefetcher (EIP) with: 1) Compressed Entry capturing up to eight destinations using 36 bits via spatial clustering, 2) Hierarchical Metadata Storage keeping only L1 resident/frequently queried entries on-chip while virtualizing bulk metadata, 3) Lightweight Online ML Controller scoring prefetch profitability using context features and bandit-adjusted threshold.

Result: Preserves EIP-like speedups with smaller on-chip state and improves efficiency for networked services in the ML era.

Conclusion: The proposed instruction prefetching approach effectively addresses frontend stalls in cloud workloads while maintaining performance with reduced hardware overhead.

Abstract: Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these cloud workloads and present a design that aligns with SLO driven and self optimizing systems. Building on the Entangling Instruction Prefetcher (EIP), we introduce a Compressed Entry that captures up to eight destinations around a base using 36 bits by exploiting spatial clustering, and a Hierarchical Metadata Storage scheme that keeps only L1 resident and frequently queried entries on chip while virtualizing bulk metadata into lower levels. We further add a lightweight Online ML Controller that scores prefetch profitability using context features and a bandit adjusted threshold. On data center applications, our approach preserves EIP like speedups with smaller on chip state and improves efficiency for networked services in the ML era.

[1169] Measuring Model Performance in the Presence of an Intervention

Winston Chen, Michael W. Sjoding, Jenna Wiens

Main category: cs.LG

TL;DR: Proposes NPW method for unbiased AI model evaluation using all RCT data by reweighting treatment group data to mimic counterfactual distributions, improving model selection efficiency.

DetailsMotivation: Standard RCT model evaluation ignores treatment group data, which is inefficient given RCT costs. Interventions bias evaluation when naively aggregating performance estimates.

Method: Theoretically quantify bias from naive aggregation, derive conditions for incorrect model selection, and propose NPW - reweighting treatment group data to mimic counterfactual distributions under no intervention.

Result: NPW consistently yields better model selection than standard approach across various intervention effects and sample sizes in synthetic and real-world datasets.

Conclusion: NPW enables more efficient model evaluation using all RCT data, representing meaningful progress for real-world AI applications where RCTs are costly.

Abstract: AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, allowing data from the control group to be used for unbiased model evaluation. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify the estimation bias that arises from naïvely aggregating performance estimates from treatment and control groups and derive the condition under which this bias leads to incorrect model selection. Leveraging these theoretical insights, we propose nuisance parameter weighting (NPW), an unbiased model evaluation approach that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome under no intervention. Using synthetic and real-world datasets, we demonstrate that our proposed evaluation approach consistently yields better model selection than the standard approach, which ignores data from the treatment group, across various intervention effect and sample size settings. Our contribution represents a meaningful step towards more efficient model evaluation in real-world contexts.

[1170] Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets

Runhan Shi, Letian Chen, Gufeng Yu, Yang Yang

Main category: cs.LG

TL;DR: ReaDISH is a novel chemical reaction prediction model that addresses permutation sensitivity and inadequate substructural interaction modeling through symmetric difference shingle encoding and geometry-structure interaction attention.

DetailsMotivation: Existing machine learning models for chemical reaction prediction suffer from sensitivity to input permutations (molecule/atom orderings) and poor modeling of substructural interactions that govern reactivity, leading to inconsistent predictions and poor generalization.

Method: ReaDISH introduces two key innovations: 1) symmetric difference shingle encoding that extends DRFP with continuous high-dimensional embeddings to capture structural changes while eliminating order sensitivity, and 2) geometry-structure interaction attention that models intra- and inter-molecular interactions at the shingle level.

Result: Extensive experiments show ReaDISH improves reaction prediction performance across diverse benchmarks and demonstrates enhanced robustness with an average 8.76% improvement on R² under permutation perturbations.

Conclusion: ReaDISH successfully addresses critical limitations in chemical reaction prediction by learning permutation-invariant representations while incorporating interaction-aware features, leading to more consistent and generalizable predictions.

Abstract: Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.

João Mattos, Debolina Halder Lina, Arlei Silva

Main category: cs.LG

TL;DR: The paper critiques existing fairness approaches in link prediction, showing dyadic fairness definitions obscure subgroup disparities and demographic parity is inadequate for ranking tasks. It proposes a new evaluation framework and a lightweight post-processing method that achieves superior fairness-utility trade-offs.

DetailsMotivation: Current fairness approaches in link prediction use dyadic definitions that can hide underlying subgroup disparities, and demographic parity doesn't work well for ranking-based tasks like link prediction, allowing systemic biases to persist undetected.

Method: The authors formalize limitations of existing fairness evaluations and propose a new expressive assessment framework. They also develop a lightweight post-processing method combined with decoupled link predictors to mitigate bias.

Result: The proposed method effectively mitigates bias and achieves state-of-the-art fairness-utility trade-offs in link prediction tasks.

Conclusion: Existing dyadic fairness definitions are insufficient for link prediction, and the proposed framework and post-processing method provide better bias detection and mitigation while maintaining utility.

Abstract: Link prediction is a fundamental task in graph machine learning with applications, ranging from social recommendation to knowledge graph completion. Fairness in this setting is critical, as biased predictions can exacerbate societal inequalities. Prior work adopts a dyadic definition of fairness, enforcing fairness through demographic parity between intra-group and inter-group link predictions. However, we show that this dyadic framing can obscure underlying disparities across subgroups, allowing systemic biases to go undetected. Moreover, we argue that demographic parity does not meet desired properties for fairness assessment in ranking-based tasks such as link prediction. We formalize the limitations of existing fairness evaluations and propose a framework that enables a more expressive assessment. Additionally, we propose a lightweight post-processing method combined with decoupled link predictors that effectively mitigates bias and achieves state-of-the-art fairness-utility trade-offs.

[1172] Beyond Observations: Reconstruction Error-Guided Irregularly Sampled Time Series Representation Learning

Jiexi Liu, Meng Cao, Songcan Chen

Main category: cs.LG

TL;DR: iTimER is a self-supervised pre-training framework that uses reconstruction errors as learning signals to improve irregularly sampled time series representation learning through pseudo-observation generation and distribution alignment.

DetailsMotivation: Existing methods for irregularly sampled time series overlook the valuable learning signal from reconstruction errors during training, which can serve as informative proxies for unobserved values.

Method: Models reconstruction error distribution over observed values, generates pseudo-observations via mixup between sampled errors and last observations, uses Wasserstein metric for distribution alignment, and incorporates contrastive learning.

Result: Extensive experiments show iTimER consistently outperforms state-of-the-art methods on classification, interpolation, and forecasting tasks for irregularly sampled time series.

Conclusion: Reconstruction errors provide valuable learning signals for irregular time series modeling, and iTimER effectively leverages this insight to achieve superior performance across multiple tasks.

Abstract: Irregularly sampled time series (ISTS), characterized by non-uniform time intervals with natural missingness, are prevalent in real-world applications. Existing approaches for ISTS modeling primarily rely on observed values to impute unobserved ones or infer latent dynamics. However, these methods overlook a critical source of learning signal: the reconstruction error inherently produced during model training. Such error implicitly reflects how well a model captures the underlying data structure and can serve as an informative proxy for unobserved values. To exploit this insight, we propose iTimER, a simple yet effective self-supervised pre-training framework for ISTS representation learning. iTimER models the distribution of reconstruction errors over observed values and generates pseudo-observations for unobserved timestamps through a mixup strategy between sampled errors and the last available observations. This transforms unobserved timestamps into noise-aware training targets, enabling meaningful reconstruction signals. A Wasserstein metric aligns reconstruction error distributions between observed and pseudo-observed regions, while a contrastive learning objective enhances the discriminability of learned representations. Extensive experiments on classification, interpolation, and forecasting tasks demonstrate that iTimER consistently outperforms state-of-the-art methods under the ISTS setting.

[1173] Learning Quantized Continuous Controllers for Integer Hardware

Fabian Kresse, Christoph H. Lampert

Main category: cs.LG

TL;DR: Quantization-aware training enables deployment of reinforcement learning policies on FPGAs using only 2-3 bits per weight and activation, achieving microsecond latencies and microjoule energy consumption while maintaining performance comparable to FP32 policies.

DetailsMotivation: Deploying continuous-control RL policies on embedded hardware requires meeting tight latency and power budgets, which small FPGAs can deliver but only if costly floating-point pipelines are avoided.

Method: Developed a learning-to-hardware pipeline that uses quantization-aware training (QAT) to automatically select low-bit policies and synthesize them to an Artix-7 FPGA for integer inference.

Result: Achieved policies competitive with FP32 using only 2-3 bits per weight and activation, with inference latencies of microseconds and microjoules per action on hardware. Also observed increased input noise robustness compared to floating-point baseline.

Conclusion: Quantization-aware training enables efficient deployment of RL policies on embedded FPGAs with minimal precision requirements while maintaining performance and even improving noise robustness.

Abstract: Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

[1174] A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics

Yizhou Zhang

Main category: cs.LG

TL;DR: A generalized spectral framework unifies learning dynamics and compression phenomena through a polynomial spectral evolution function, recovering existing theories as special cases.

DetailsMotivation: To reconcile apparently distinct scaling behaviors in model learning and compression by developing a unified spectral analysis framework.

Method: Developed a generalized spectral framework with polynomial spectral evolution function g(λ,t;β) characterized by spectral-temporal elasticity ρ(β), extending beyond the linear kernel form.

Result: The framework successfully unifies learning dynamics and compression phenomena, recovering lazy and feature-learning theories as special cases.

Conclusion: The generalized spectral framework provides an invariant relation between learning and compression, offering a unified perspective on scaling behaviors across different regimes.

Abstract: Empirical scaling laws describe how test loss and other performance metrics depend on model size, dataset size, and compute. While such laws are consistent within specific regimes, apparently distinct scaling behaviors have been reported for related settings such as model compression. Motivated by recent progress in spectral analyses of neural representations, this paper develops a \emph{generalized spectral framework} that unifies learning dynamics and compression phenomena under a common functional ansatz. We generalize the spectral evolution function from the linear kernel form $g(λt)=λt$ to an asymptotically polynomial function $g(λ,t;β)$, characterized by an effective spectral–temporal elasticity $ρ(β)$. This framework recovers existing lazy and feature-learning theories as special cases and yields an invariant relation between learning and compression

[1175] Test-driven Reinforcement Learning

Zhao Yu, Xiuping Wu, Liangjun Ke

Main category: cs.LG

TL;DR: TdRL replaces single reward functions with multiple test functions to simplify RL task definition and improve learning, achieving comparable or better performance than handcrafted rewards.

DetailsMotivation: Manual reward design in RL is challenging and often leads to suboptimal task representation since rewards serve dual purposes of defining goals and guiding learning.

Method: Proposes Test-driven RL framework using pass-fail tests for optimal objective definition and indicative tests for learning guidance, with lexicographic heuristic for trajectory comparison and maximum entropy policy optimization.

Result: Experiments on DeepMind Control Suite show TdRL matches or outperforms handcrafted reward methods with simpler design and inherent multi-objective optimization support.

Conclusion: TdRL provides a novel perspective for task objective representation that helps address reward design challenges in RL applications.

Abstract: Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.

[1176] Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

Junkai Lu, Peng Chen, Chenjuan Guo, Yang Shu, Meng Wang, Bin Yang

Main category: cs.LG

TL;DR: DTAF is a dual-branch framework that addresses non-stationarity in time series forecasting by handling temporal distribution shifts and spectral variability through temporal stabilizing fusion and frequency wave modeling modules.

DetailsMotivation: Real-world time series often exhibit non-stationarity including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting across domains like energy, finance, transportation, and cloud computing.

Method: DTAF uses a dual-branch approach: Temporal Stabilizing Fusion (TFS) module employs non-stationary mix of experts filter to suppress temporal non-stationary patterns while preserving dependencies; Frequency Wave Modeling (FWM) module applies frequency differencing to highlight components with spectral shifts.

Result: Extensive experiments on real-world benchmarks show DTAF outperforms state-of-the-art baselines and yields significant improvements in forecasting accuracy under non-stationary conditions.

Conclusion: DTAF effectively addresses non-stationarity in both temporal and frequency domains, generating robust forecasts that adapt to complex real-world time series patterns.

Abstract: Time series forecasting is critical for decision-making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions. All codes are available at https://github.com/PandaJunk/DTAF.

[1177] Practical Global and Local Bounds in Gaussian Process Regression via Chaining

Junyi Liu, Stanley Kok

Main category: cs.LG

TL;DR: A chaining-based framework for estimating bounds on expected extreme values in Gaussian process regression, providing tighter kernel-specific bounds and local uncertainty quantification without requiring input feature access.

DetailsMotivation: Existing uncertainty bounds in GPR require specific input features, rely on posterior estimates or hyperparameter tuning, lack robustness, and fail to capture global model behavior in expectation.

Method: Proposed chaining-based framework with kernel-specific refinements for RBF and Matérn kernels, avoiding analytical relaxations for numerical tightness, and developing local uncertainty quantification using chaining geometry through partition diameters.

Result: Experimental validation shows tighter bounds than generic constructions and outperforms existing approaches on synthetic and real-world datasets.

Conclusion: The proposed framework provides robust global and local uncertainty quantification in GPR without requiring input feature access, with kernel-specific refinements offering improved tightness over existing methods.

Abstract: Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches require access to specific input features, and rely on posterior mean and variance estimates or the tuning of hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation. To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input features. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structures without relying on posterior variance scaling. Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

[1178] GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

Jingquan Yan, Yuwei Miao, Lei Yu, Yuzhi Guo, Xue Xiao, Lin Xu, Junzhou Huang

Main category: cs.LG

TL;DR: GenePheno is the first interpretable multi-label framework that predicts knockout-induced phenotypic abnormalities directly from gene sequences, using contrastive learning and biological regularization to capture phenotype correlations and functional mechanisms.

DetailsMotivation: Existing methods either focus on limited phenotypes or rely on curated genetic information, limiting scalability. The gap between sequences and phenotypes, plus pleiotropic gene relationships, makes direct prediction from sequences challenging.

Method: Uses contrastive multi-label learning with exclusive regularization for biological consistency, plus a gene function bottleneck layer for interpretability. Trained on curated datasets with gene sequences as input and multi-label phenotypic abnormalities as targets.

Result: Achieves state-of-the-art gene-centric F_max and phenotype-centric AUC across four datasets. Case studies demonstrate ability to reveal gene functional mechanisms.

Conclusion: GenePheno successfully bridges the modality gap between sequences and phenotypes, providing an interpretable and scalable framework for predicting knockout-induced abnormalities directly from gene sequences.

Abstract: Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric $F_{\text{max}}$ and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

[1179] Parametric Expensive Multi-Objective Optimization via Generative Solution Modeling

Tingyang Wei, Jiao Liu, Abhishek Gupta, Chin Chun Ooi, Puay Siew Tan, Yew-Soon Ong

Main category: cs.LG

TL;DR: First parametric multi-objective Bayesian optimizer that learns an inverse model to predict optimized solutions for any task-preference query without expensive re-evaluation, using alternating acquisition-driven search and generative solution sampling.

DetailsMotivation: Real-world applications require solving families of expensive multi-objective optimization problems under varying conditions, but current methods can't handle the continuous task parameter space containing infinite distinct problems.

Method: Alternating framework between (1) acquisition-driven search leveraging inter-task synergies via task-aware Gaussian processes, and (2) generative solution sampling via conditional generative models.

Result: Theoretical justification for faster convergence through inter-task synergies, and empirical verification in synthetic and real-world benchmarks showing effective direct solution prediction for unseen parameterized EMOPs.

Conclusion: The approach enables efficient optimization across related tasks and achieves direct solution prediction for unseen parameterized expensive multi-objective optimization problems without additional expensive evaluations.

Abstract: Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This gives rise to parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for individual tasks. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. This demands learning an inverse model that can directly predict optimized solutions for any task-preference query without expensive re-evaluation. This paper introduces the first parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) acquisition-driven search leveraging inter-task synergies and (2) generative solution sampling via conditional generative models. This approach enables efficient optimization across related tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Meanwhile, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of our alternating framework.

[1180] History Rhymes: Macro-Contextual Retrieval for Robust Financial Forecasting

Sarthak Khanna, Armin Berger, Muskaan Chopra, David Berghaus, Rafet Sifa

Main category: cs.LG

TL;DR: Introduces macro-contextual retrieval, a retrieval-augmented forecasting framework that grounds predictions in historically analogous macroeconomic regimes to address non-stationarity in financial markets.

DetailsMotivation: Financial markets are inherently non-stationary with structural breaks and macroeconomic regime shifts causing forecasting models to fail when deployed out of distribution. Conventional multimodal approaches don't adapt well to such shifts.

Method: Jointly embeds macro indicators (CPI, unemployment, yield spread, GDP growth) and financial news sentiment in a shared similarity space, enabling causal retrieval of precedent periods during inference without retraining.

Result: Achieves positive out-of-sample trading outcomes (AAPL: PF=1.18, Sharpe=0.95; XOM: PF=1.16, Sharpe=0.61) while static baselines collapse under regime shifts. Also provides interpretable evidence chains corresponding to recognizable macro contexts.

Conclusion: Macro-aware retrieval yields robust, explainable forecasts under distributional change by operationalizing the principle that financial history often rhymes, demonstrating effectiveness in handling regime shifts.

Abstract: Financial markets are inherently non-stationary: structural breaks and macroeconomic regime shifts often cause forecasting models to fail when deployed out of distribution (OOD). Conventional multimodal approaches that simply fuse numerical indicators and textual sentiment rarely adapt to such shifts. We introduce macro-contextual retrieval, a retrieval-augmented forecasting framework that grounds each prediction in historically analogous macroeconomic regimes. The method jointly embeds macro indicators (e.g., CPI, unemployment, yield spread, GDP growth) and financial news sentiment in a shared similarity space, enabling causal retrieval of precedent periods during inference without retraining. Trained on seventeen years of S&P 500 data (2007-2023) and evaluated OOD on AAPL (2024) and XOM (2024), the framework consistently narrows the CV to OOD performance gap. Macro-conditioned retrieval achieves the only positive out-of-sample trading outcomes (AAPL: PF=1.18, Sharpe=0.95; XOM: PF=1.16, Sharpe=0.61), while static numeric, text-only, and naive multimodal baselines collapse under regime shifts. Beyond metric gains, retrieved neighbors form interpretable evidence chains that correspond to recognizable macro contexts, such as inflationary or yield-curve inversion phases, supporting causal interpretability and transparency. By operationalizing the principle that “financial history may not repeat, but it often rhymes,” this work demonstrates that macro-aware retrieval yields robust, explainable forecasts under distributional change. All datasets, models, and source code are publicly available.

[1181] SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data

Mingkun Yang, Ran Zhu, Qing Wang, Jie Yang

Main category: cs.LG

TL;DR: SMoFi is a lightweight framework that improves Split Federated Learning by synchronizing momentum buffers and using staleness-aware alignment to handle data heterogeneity, achieving up to 7.1% accuracy improvement and 10.25x faster convergence.

DetailsMotivation: Data heterogeneity across silos in Split Federated Learning undermines convergence speed and accuracy of global models, creating a need for effective solutions.

Method: Step-wise Momentum Fusion (SMoFi) synchronizes momentum buffers across server-side optimizers and uses staleness-aware alignment mechanism to constrain gradient updates of server-side submodel.

Result: Extensive validations show SMoFi improves global model accuracy up to 7.1% and convergence speed up to 10.25x, with greater impact on more clients and deeper models.

Conclusion: SMoFi is particularly suitable for model training in resource-constrained contexts due to its effectiveness with more clients and deeper learning models.

Abstract: Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25$\times$). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.

[1182] T2IBias: Uncovering Societal Bias Encoded in the Latent Space of Text-to-Image Generative Models

Abu Sufian, Cosimo Distante, Marco Leo, Hanan Salam

Main category: cs.LG

TL;DR: This paper investigates systematic race and gender biases in five popular text-to-image models, finding they encode and amplify societal stereotypes in profession-related image generation.

DetailsMotivation: To address critical concerns about responsible AI management, specifically how T2I models reproduce and amplify race- and gender-related stereotypes that undermine organizational ethics.

Method: Empirical study across five most popular open-source T2I models using ten neutral profession-related prompts, generating 100 images per profession (5,000 total images) evaluated by diverse human assessors representing different races and genders.

Result: All five models encode pronounced societal biases: caregiving roles feminized, high-status professions overwhelmingly represented by White males. Model-specific patterns identified (QWEN-Image focuses on East Asians, Kandinsky dominates White individuals, SDXL broader but still biased).

Conclusion: Provides critical insights for AI practitioners to select equitable models and prompts aligned with responsible AI principles, discusses bias risks and proposes actionable mitigation strategies for building responsible GenAI systems.

Abstract: Text-to-image (T2I) generative models are largely used in AI-powered real-world applications and value creation. However, their strategic deployment raises critical concerns for responsible AI management, particularly regarding the reproduction and amplification of race- and gender-related stereotypes that can undermine organizational ethics. In this work, we investigate whether such societal biases are systematically encoded within the pretrained latent spaces of state-of-the-art T2I models. We conduct an empirical study across the five most popular open-source models, using ten neutral, profession-related prompts to generate 100 images per profession, resulting in a dataset of 5,000 images evaluated by diverse human assessors representing different races and genders. We demonstrate that all five models encode and amplify pronounced societal skew: caregiving and nursing roles are consistently feminized, while high-status professions such as corporate CEO, politician, doctor, and lawyer are overwhelmingly represented by males and mostly White individuals. We further identify model-specific patterns, such as QWEN-Image’s near-exclusive focus on East Asian outputs, Kandinsky’s dominance of White individuals, and SDXL’s comparatively broader but still biased distributions. These results provide critical insights for AI project managers and practitioners, enabling them to select equitable AI models and customized prompts that generate images in alignment with the principles of responsible AI. We conclude by discussing the risks of these biases and proposing actionable strategies for bias mitigation in building responsible GenAI systems. The code and Data Repository: https://github.com/Sufianlab/T2IBias

[1183] Learning Intersections of Two Margin Halfspaces under Factorizable Distributions

Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos

Main category: cs.LG

TL;DR: The paper introduces a novel algorithm that circumvents the CSQ hardness barrier for learning intersections of halfspaces, achieving polynomial time complexity under factorizable distributions.

DetailsMotivation: Learning intersections of halfspaces is a major open problem in Computational Learning Theory, with best-known algorithms running in quasi-polynomial time. The goal is to overcome the CSQ hardness barrier and achieve polynomial time learning.

Method: The approach uses a novel duality framework to analyze moment tensor structure, combining refined Jennrich’s Algorithm with PCA over random projections and gradient-descent-based non-convex optimization.

Result: The algorithm achieves poly(d,1/γ) time complexity under factorizable distributions, establishing a strong separation between CSQ and SQ methods which still require quasipolynomial time.

Conclusion: The work demonstrates that more general statistical queries (SQ) can overcome CSQ hardness barriers for learning intersections of halfspaces, providing efficient learning algorithms for a broad class of distributions.

Abstract: Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin $γ$ of the data points and their dimensionality $d$. The best-known algorithms run in quasi-polynomial time $d^{O(\log(1/γ))}$, and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents the CSQ hardness barrier. Our approach applies to a broad class of distributions satisfying a natural, previously studied, factorizability assumption. Factorizable distributions lie between distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. Under these distributions, we show that CSQ-based methods still require quasipolynomial time even for weakly learning, whereas our algorithm achieves $poly(d,1/γ)$ time by leveraging more general statistical queries (SQ), establishing a strong separation between CSQ and SQ for this simple realizable PAC learning problem. Our result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, we propose new, efficient learning algorithms. These algorithms combine a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework.

[1184] Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training

Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin

Main category: cs.LG

TL;DR: SWaST is a simultaneous weight pruning and coreset selection mechanism that creates synergistic effects in training, achieving significant computational efficiency gains and accuracy improvements while addressing the critical double-loss problem through state preservation.

DetailsMotivation: Modern deep neural networks suffer from high computational costs due to massive model weights and training samples. The paper explores the interplay between redundant weights and samples, where redundant samples cause weights to overtune and irrelevant weights overfit noisy data, undermining the effectiveness of pruning and coreset selection when used independently.

Method: Developed Simultaneous Weight and Sample Tailoring (SWaST) that alternately performs weight pruning and coreset selection. To address the critical double-loss problem (where important weights and supportive samples are mistakenly removed together), integrated a state preservation mechanism for stable joint optimization.

Result: Extensive experiments show strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, achieving accuracy boosts up to 17.83% alongside 10% to 90% FLOPs reductions.

Conclusion: The interplay between weight pruning and coreset selection can be effectively harnessed through SWaST with state preservation, creating synergistic effects that significantly improve both computational efficiency and model accuracy in deep learning.

Abstract: Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

[1185] Out-of-Context Misinformation Detection via Variational Domain-Invariant Learning with Test-Time Training

Xi Yang, Han Zhang, Zhijian Lin, Yibiao Hu, Hong Han

Main category: cs.LG

TL;DR: VDT enhances domain adaptation for out-of-context misinformation detection by learning domain-invariant features and using test-time training to handle novel news domains.

DetailsMotivation: Current methods for OOC misinformation detection assume same training-test distributions and perform poorly on novel domains due to lack of prior knowledge.

Method: Proposes VDT with Domain-Invariant Variational Align module for joint encoding, domain consistency constraints for semantic integrity, and test-time training with confidence-variance filtering.

Result: Extensive experiments on NewsCLIPpings dataset show VDT outperforms state-of-the-art baselines under most domain adaptation settings.

Conclusion: VDT effectively addresses domain adaptation challenges in OOC misinformation detection through domain-invariant feature learning and dynamic test-time adaptation.

Abstract: Out-of-context misinformation (OOC) is a low-cost form of misinformation in news reports, which refers to place authentic images into out-of-context or fabricated image-text pairings. This problem has attracted significant attention from researchers in recent years. Current methods focus on assessing image-text consistency or generating explanations. However, these approaches assume that the training and test data are drawn from the same distribution. When encountering novel news domains, models tend to perform poorly due to the lack of prior knowledge. To address this challenge, we propose \textbf{VDT} to enhance the domain adaptation capability for OOC misinformation detection by learning domain-invariant features and test-time training mechanisms. Domain-Invariant Variational Align module is employed to jointly encodes source and target domain data to learn a separable distributional space domain-invariant features. For preserving semantic integrity, we utilize domain consistency constraint module to reconstruct the source and target domain latent distribution. During testing phase, we adopt the test-time training strategy and confidence-variance filtering module to dynamically updating the VAE encoder and classifier, facilitating the model’s adaptation to the target domain distribution. Extensive experiments conducted on the benchmark dataset NewsCLIPpings demonstrate that our method outperforms state-of-the-art baselines under most domain adaptation settings.

[1186] Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions

Jiazhou Liang, Hassan Khurram, Scott Sanner

Main category: cs.LG

TL;DR: Two novel approaches for Linear Predictive Clustering that improve global optimization efficiency using separability properties and QPBO approximation, achieving near-optimal solutions with better scalability than existing methods.

DetailsMotivation: Existing greedy optimization methods for LPC lack global optimality and struggle with non-separable clusters, while MIP formulations ensure global optimality but suffer from poor scalability.

Method: Leverage theoretical properties of separability to derive near-optimal approximations with provable error bounds, and approximate LPC as Quadratic Pseudo-Boolean Optimization (QPBO) problem.

Result: Methods achieve near-optimal solutions with substantially lower regression errors than greedy optimization and superior scalability over existing MIP formulations on synthetic and real-world datasets.

Conclusion: The proposed approaches successfully bridge the gap between greedy optimization and MIP formulations, providing efficient global optimization for LPC with practical scalability.

Abstract: Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas and Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving substantial computational improvements in some settings. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

[1187] Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving Games

Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang

Main category: cs.LG

TL;DR: ScaPT is a scalable population training framework for zero-shot coordination that uses parameter sharing and mutual information regularization to efficiently train large populations while maintaining diversity.

DetailsMotivation: Existing population-based methods for zero-shot coordination are limited by computational resources and focus on small populations, missing potential performance gains from scaling population size.

Method: Proposes Scalable Population Training (ScaPT) with two components: a meta-agent that selectively shares parameters across agents for efficiency, and a mutual information regularizer that ensures population diversity.

Result: Empirical evaluation in Hanabi shows ScaPT achieves superior performance compared to representative frameworks, demonstrating its effectiveness in zero-shot coordination.

Conclusion: ScaPT provides an efficient solution for scaling population training in zero-shot coordination, overcoming computational limitations while maintaining diversity and achieving better performance.

Abstract: Zero-shot coordination(ZSC) has become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.

[1188] Virtual Width Networks

Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chengyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang

Main category: cs.LG

TL;DR: Virtual Width Networks (VWN) enable wider representations without quadratic cost by decoupling representational width from backbone width, achieving 2-3x optimization acceleration and log-linear scaling with virtual width.

DetailsMotivation: To overcome the quadratic cost of increasing hidden size in neural networks while still benefiting from wider representations for improved model performance and efficiency.

Method: Decouple representational width from backbone width, expanding embedding space while keeping backbone compute nearly constant through virtual width expansion.

Result: 8x expansion accelerates optimization by over 2x for next-token and 3x for next-2-token prediction, with loss gap growing and convergence-speedup ratio increasing over training.

Conclusion: VWN is token-efficient and increasingly effective with scale, with log-linear scaling between virtual width and loss reduction, suggesting virtual-width scaling as a new dimension for large-model efficiency.

Abstract: We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

[1189] Quantifying and Improving Adaptivity in Conformal Prediction through Input Transformations

Sooyong Jang, Insup Lee

Main category: cs.LG

TL;DR: Proposes improved metrics for evaluating adaptiveness in conformal prediction using uniform-mass binning, and introduces a new adaptive prediction set algorithm that groups examples by difficulty.

DetailsMotivation: Existing evaluation methods for adaptiveness in conformal prediction suffer from imbalanced binning, leading to inaccurate estimates of coverage or set size.

Method: Uses input transformations to sort examples by difficulty, applies uniform-mass binning, and introduces two new metrics. Also proposes a new algorithm that groups examples by estimated difficulty and applies group-conditional conformal prediction.

Result: The proposed metrics correlate more strongly with desired adaptiveness compared to existing ones. The new algorithm outperforms existing approaches on ImageNet classification and medical visual acuity prediction tasks.

Conclusion: The proposed binning method and metrics provide more reliable adaptiveness evaluation, and the new adaptive prediction set algorithm achieves better performance.

Abstract: Conformal prediction constructs a set of labels instead of a single point prediction, while providing a probabilistic coverage guarantee. Beyond the coverage guarantee, adaptiveness to example difficulty is an important property. It means that the method should produce larger prediction sets for more difficult examples, and smaller ones for easier examples. Existing evaluation methods for adaptiveness typically analyze coverage rate violation or average set size across bins of examples grouped by difficulty. However, these approaches often suffer from imbalanced binning, which can lead to inaccurate estimates of coverage or set size. To address this issue, we propose a binning method that leverages input transformations to sort examples by difficulty, followed by uniform-mass binning. Building on this binning, we introduce two metrics to better evaluate adaptiveness. These metrics provide more reliable estimates of coverage rate violation and average set size due to balanced binning, leading to more accurate adaptivity assessment. Through experiments, we demonstrate that our proposed metric correlates more strongly with the desired adaptiveness property compared to existing ones. Furthermore, motivated by our findings, we propose a new adaptive prediction set algorithm that groups examples by estimated difficulty and applies group-conditional conformal prediction. This allows us to determine appropriate thresholds for each group. Experimental results on both (a) an Image Classification (ImageNet) (b) a medical task (visual acuity prediction) show that our method outperforms existing approaches according to the new metrics.

[1190] A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampled vs. Sampled-to-All Communication

Angelo Rodio, Giovanni Neglia, Zheng Chen, Erik G. Larsson

Main category: cs.LG

TL;DR: The paper compares two communication strategies in semi-decentralized federated learning: sampled-to-sampled (S2S) where only sampled clients receive aggregated models, and sampled-to-all (S2A) where all clients receive aggregated models. The analysis reveals performance depends on data heterogeneity.

DetailsMotivation: Despite practical significance, there was no rigorous theoretical and empirical comparison between S2S and S2A strategies in semi-decentralized FL, where devices primarily use device-to-device communication with occasional server interaction.

Method: Developed a unified convergence framework analyzing S2S and S2A strategies, accounting for key system parameters including sampling rate, server aggregation frequency, and network connectivity.

Result: Analysis reveals distinct performance regimes where one strategy outperforms the other depending primarily on the degree of data heterogeneity across devices.

Conclusion: The study provides concrete design guidelines for practical semi-decentralized FL deployments based on data heterogeneity conditions.

Abstract: In semi-decentralized federated learning, devices primarily rely on device-to-device communication but occasionally interact with a central server. Periodically, a sampled subset of devices uploads their local models to the server, which computes an aggregate model. The server can then either (i) share this aggregate model only with the sampled clients (sampled-to-sampled, S2S) or (ii) broadcast it to all clients (sampled-to-all, S2A). Despite their practical significance, a rigorous theoretical and empirical comparison of these two strategies remains absent. We address this gap by analyzing S2S and S2A within a unified convergence framework that accounts for key system parameters: sampling rate, server aggregation frequency, and network connectivity. Our results, both analytical and experimental, reveal distinct regimes where one strategy outperforms the other, depending primarily on the degree of data heterogeneity across devices. These insights lead to concrete design guidelines for practical semi-decentralized FL deployments.

cs.MA

[1191] MALBO: Optimizing LLM-Based Multi-Agent Teams via Multi-Objective Bayesian Optimization

Antonio Sabbatella

Main category: cs.MA

TL;DR: MALBO is a Bayesian optimization framework that automates efficient LLM team composition by finding Pareto-optimal configurations balancing performance and cost.

DetailsMotivation: Current methods lack principled frameworks for multi-agent, multi-objective optimization of LLM assignments in agent teams, facing challenges with combinatorial search space and expensive evaluations.

Method: Uses multi-objective Bayesian Optimization with independent Gaussian Process surrogates to search over continuous LLM feature-space, guided by expected hypervolume improvement.

Result: Reduced average configuration cost by over 45% compared to random search, with specialized heterogeneous teams achieving up to 65.8% cost reduction while maintaining maximum performance.

Conclusion: MALBO provides a data-driven tool for deploying cost-effective, highly specialized multi-agent AI systems through automated Pareto-optimal team configuration.

Abstract: The optimal assignment of Large Language Models (LLMs) to specialized roles in multi-agent systems is a significant challenge, defined by a vast combinatorial search space, expensive black-box evaluations, and an inherent trade-off between performance and cost. Current optimization methods focus on single-agent settings and lack a principled framework for this multi-agent, multi-objective problem. This thesis introduces MALBO (Multi-Agent LLM Bayesian Optimization), a systematic framework designed to automate the efficient composition of LLM-based agent teams. We formalize the assignment challenge as a multi-objective optimization problem, aiming to identify the Pareto front of configurations between task accuracy and inference cost. The methodology employs multi-objective Bayesian Optimization (MOBO) with independent Gaussian Process surrogate models. By searching over a continuous feature-space representation of the LLMs, this approach performs a sample-efficient exploration guided by the expected hypervolume improvement. The primary contribution is a principled and automated methodology that yields a Pareto front of optimal team configurations. Our results demonstrate that the Bayesian optimization phase, compared to an initial random search, maintained a comparable average performance while reducing the average configuration cost by over 45%. Furthermore, MALBO identified specialized, heterogeneous teams that achieve cost reductions of up to 65.8% compared to homogeneous baselines, all while maintaining maximum performance. The framework thus provides a data-driven tool for deploying cost-effective and highly specialized multi-agent AI systems.

[1192] From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions

Jiayi Li, Xiao Liu, Yansong Feng

Main category: cs.MA

TL;DR: Personas in LLM-based multi-agent systems introduce biases in trustworthiness and insistence, with historically advantaged groups perceived as less trustworthy and showing less insistence, along with significant in-group favoritism.

DetailsMotivation: To investigate whether assigning personas to LLM-based agents introduces biases in multi-agent interactions, particularly regarding social traits like trustworthiness and insistence.

Method: Conducted controlled experiments in collaborative problem-solving and persuasion tasks across various LLMs, group sizes, and interaction rounds to measure persona-induced biases.

Result: Found that personas from historically advantaged groups (men, White individuals) are perceived as less trustworthy and demonstrate less insistence, and agents show significant in-group favoritism by conforming more to those with similar personas.

Conclusion: Persona-induced biases persist across different conditions, highlighting an urgent need for awareness and mitigation to ensure fairness and reliability in multi-agent systems.

Abstract: Large Language Model (LLM)-based multi-agent systems are increasingly used to simulate human interactions and solve collaborative tasks. A common practice is to assign agents with personas to encourage behavioral diversity. However, this raises a critical yet underexplored question: do personas introduce biases into multi-agent interactions? This paper presents a systematic investigation into persona-induced biases in multi-agent interactions, with a focus on social traits like trustworthiness (how an agent’s opinion is received by others) and insistence (how strongly an agent advocates for its opinion). Through a series of controlled experiments in collaborative problem-solving and persuasion tasks, we reveal that (1) LLM-based agents exhibit biases in both trustworthiness and insistence, with personas from historically advantaged groups (e.g., men and White individuals) perceived as less trustworthy and demonstrating less insistence; and (2) agents exhibit significant in-group favoritism, showing a higher tendency to conform to others who share the same persona. These biases persist across various LLMs, group sizes, and numbers of interaction rounds, highlighting an urgent need for awareness and mitigation to ensure the fairness and reliability of multi-agent systems.

[1193] Conflict-Free Flight Scheduling Using Strategic Demand Capacity Balancing for Urban Air Mobility Operations

Vahid Hemmati, Yonas Ayalew, Ahmad Mohammadi, Reza Ahmari, Parham Kebria, Abdollah Homaifar, Mehrdad Saif

Main category: cs.MA

TL;DR: Proposes a conflict-free multi-agent flight scheduling system for Urban Air Mobility using delayed departures and kinematic principles to ensure safe separation in constrained airspace.

DetailsMotivation: To address the challenge of ensuring robust separation in constrained airspace for emerging Urban Air Mobility operations with increasing traffic densities.

Method: Uses Pairwise Conflict Avoidance (PCA) based on delayed departures and kinematic principles, then expands to multi-agent scenarios with optimization to determine departure times systematically.

Result: Numerical simulations show significant reduction in total delay while maintaining collision-free operations across diverse multi-agent environments and real-world UAM use cases.

Conclusion: Provides a scalable framework for urban air mobility systems that effectively manages flight scheduling with minimal delays and guaranteed safety.

Abstract: In this paper, we propose a conflict-free multi- agent flight scheduling that ensures robust separation in con- strained airspace for Urban Air Mobility (UAM) operations application. First, we introduce Pairwise Conflict Avoidance (PCA) based on delayed departures, leveraging kinematic principles to maintain safe distances. Next, we expand PCA to multi-agent scenarios, formulating an optimization approach that systematically determines departure times under increasing traffic densities. Performance metrics, such as average delay, assess the effectiveness of our solution. Through numerical simulations across diverse multi-agent environments and real- world UAM use cases, our method demonstrates a significant reduction in total delay while ensuring collision-free operations. This approach provides a scalable framework for emerging urban air mobility systems.

[1194] Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

Hung Du, Hy Nguyen, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis

Main category: cs.MA

TL;DR: A decentralized Multi-Agent Reinforcement Learning framework enables autonomous vehicles to communicate selectively based on local goals, improving coordination in dynamic environments with limited communication and partial observability.

DetailsMotivation: Connected autonomous vehicles operate in dynamic, unpredictable environments with limited communication, no centralized control, and partial observability, posing significant coordination challenges when vehicles pursue individual objectives.

Method: Proposed a decentralized MARL framework where vehicles communicate selectively based on local goals and observations, using goal-aware communication to share only relevant information while respecting visibility limitations.

Result: Significantly improved task success rates and reduced time-to-goal compared to non-cooperative baselines in complex multi-agent navigation tasks with obstacles and dynamic agent populations. Performance remained stable as agent numbers increased, demonstrating scalability.

Conclusion: Decentralized, goal-driven MARL shows potential for effective coordination in realistic multi-vehicle systems operating across diverse domains, addressing real-world constraints of limited communication and partial observability.

Abstract: Connected and autonomous vehicles across land, water, and air must often operate in dynamic, unpredictable environments with limited communication, no centralized control, and partial observability. These real-world constraints pose significant challenges for coordination, particularly when vehicles pursue individual objectives. To address this, we propose a decentralized Multi-Agent Reinforcement Learning (MARL) framework that enables vehicles, acting as agents, to communicate selectively based on local goals and observations. This goal-aware communication strategy allows agents to share only relevant information, enhancing collaboration while respecting visibility limitations. We validate our approach in complex multi-agent navigation tasks featuring obstacles and dynamic agent populations. Results show that our method significantly improves task success rates and reduces time-to-goal compared to non-cooperative baselines. Moreover, task performance remains stable as the number of agents increases, demonstrating scalability. These findings highlight the potential of decentralized, goal-driven MARL to support effective coordination in realistic multi-vehicle systems operating across diverse domains.

[1195] FINRS: A Risk-Sensitive Trading Framework for Real Financial Markets

Bijia Liu, Ronghao Dang

Main category: cs.MA

TL;DR: FinRS is a risk-sensitive LLM-based trading framework that integrates hierarchical market analysis, dual-decision agents, and multi-timescale reward reflection to improve profitability and stability in volatile markets.

DetailsMotivation: Existing LLM-based trading agents focus on single-step prediction and lack integrated risk management mechanisms, reducing their effectiveness in volatile markets.

Method: Combines hierarchical market analysis, dual-decision agents, and multi-timescale reward reflection to align trading actions with both return objectives and downside risk constraints.

Result: Experiments on multiple stocks and market conditions show superior profitability and stability compared to state-of-the-art methods.

Conclusion: FinRS demonstrates that integrating risk management mechanisms into LLM-based trading frameworks significantly improves performance in volatile financial markets.

Abstract: Large language models (LLMs) have shown strong reasoning capabilities and are increasingly explored for financial trading. Existing LLM-based trading agents, however, largely focus on single-step prediction and lack integrated mechanisms for risk management, which reduces their effectiveness in volatile markets. We introduce FinRS, a risk-sensitive trading framework that combines hierarchical market analysis, dual-decision agents, and multi-timescale reward reflection to align trading actions with both return objectives and downside risk constraints. Experiments on multiple stocks and market conditions show that FinRS achieves superior profitability and stability compared to state-of-the-art methods.

[1196] ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

Daivik Patel, Shrenik Patel

Main category: cs.MA

TL;DR: ENGRAM is a lightweight memory system for LLMs that organizes conversations into three memory types (episodic, semantic, procedural) using simple dense retrieval, achieving state-of-the-art results without complex architectures.

DetailsMotivation: Current memory systems for LLMs use complex architectures like knowledge graphs and multi-stage pipelines, which introduce engineering complexity and reproducibility challenges. There's a need for simpler, more effective long-term memory management.

Method: ENGRAM uses a single router and retriever to organize conversations into three canonical memory types. Each user turn is converted into typed memory records with normalized schemas and embeddings, stored in a database. At query time, it retrieves top-k dense neighbors for each type and merges results with simple set operations.

Result: ENGRAM achieves state-of-the-art results on LoCoMo benchmark and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens.

Conclusion: Careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

Abstract: Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as context to the model. ENGRAM attains state-of-the-art results on LoCoMo, a multi-session conversational QA benchmark for long-horizon memory, and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens. These results show that careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

[1197] Reuse, Don’t Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Daivik Patel, Shrenik Patel

Main category: cs.MA

TL;DR: ENGRAM-R is an inference-time memory layer that uses typed retrieval with compact fact cards and citation control to reduce token usage by 85% for inputs and 75% for reasoning while maintaining accuracy.

DetailsMotivation: Large reasoning models incur high costs in tokens and latency through test-time scaling. Memory can enable efficient reasoning by reusing structured evidence instead of recomputing derivations.

Method: ENGRAM-R integrates typed retrieval with compact fact card representations and explicit citation control as an inference-time memory layer.

Result: On LoCoMo benchmark: 85% reduction in input tokens, 75% reduction in reasoning tokens while maintaining high accuracy. On LongMemEval multi-hop slice: similar efficiency with substantial accuracy gains.

Conclusion: Memory is critical for long-horizon correctness and serves as a practical lever for efficient reasoning under tight compute, memory, and latency constraints.

Abstract: Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.

[1198] LLM-based Multi-Agent System for Simulating Strategic and Goal-Oriented Data Marketplaces

Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, Teruaki Hayashi

Main category: cs.MA

TL;DR: Proposes LLM-based multi-agent system for data marketplaces that enables autonomous strategic actions and better reproduces real trading patterns than traditional approaches.

DetailsMotivation: Limited understanding of interactions between market participants, data, and regulations in data marketplaces despite their growing importance for data trading.

Method: LLM-powered buyer and seller agents with explicit objectives autonomously perform strategic actions like planning, searching, purchasing, pricing, and updating data through natural language reasoning.

Result: LLM-MAS more faithfully reproduces real marketplace trading patterns and captures emergence/evolution of market trends compared to traditional model-based simulations.

Conclusion: LLM-MAS framework enables broader and more adaptive behavior in data marketplace simulations, better capturing complex market dynamics through natural language reasoning.

Abstract: Data marketplaces, which mediate the purchase and exchange of data from third parties, have attracted growing attention for reducing the cost and effort of data collection while enabling the trading of diverse datasets. However, a systematic understanding of the interactions between market participants, data, and regulations remains limited. To address this gap, we propose a Large Language Model-based Multi-Agent System (LLM-MAS) for data marketplaces. In our framework, buyer and seller agents powered by LLMs operate with explicit objectives and autonomously perform strategic actions, such as planning, searching, purchasing, pricing, and updating data. These agents can reason about market dynamics, forecast future demand, and adjust strategies accordingly. Unlike conventional model-based simulations, which are typically constrained to predefined rules, LLM-MAS supports broader and more adaptive behavior selection through natural language reasoning. We evaluated the framework via simulation experiments using three distribution-based metrics: (1) the number of purchases per dataset, (2) the number of purchases per buyer, and (3) the number of repeated purchases of the same dataset. The results demonstrate that LLM-MAS more faithfully reproduces trading patterns observed in real data marketplaces compared to traditional approaches, and further captures the emergence and evolution of market trends.

[1199] How Hard is it to Explain Preferences Using Few Boolean Attributes?

Clemens Anzinger, Jiehua Chen, Christian Hatschka, Manuel Sorge, Alexander Temper

Main category: cs.MA

TL;DR: BAM is NP-complete for k≥3 attributes but linear-time solvable for k≤2. The problem is fixed-parameter tractable by number of alternatives and has a linear-time algorithm for two voters.

DetailsMotivation: To understand preference structure and enable efficient decision-making through Boolean attribute models that explain preference data.

Method: Analyze computational complexity of Boolean attribute models (BAMs) where alternatives have Boolean attributes and voters prefer alternatives with more desired attributes.

Result: Established complexity dichotomy: BAM is linear-time solvable for k≤2 attributes but NP-complete for k≥3. Problem is fixed-parameter tractable by number of alternatives. Linear-time algorithm exists for two voters.

Conclusion: BAM exhibits sharp complexity transitions based on number of attributes, with partial information variants showing different tractability patterns.

Abstract: We study the computational complexity of explaining preference data through Boolean attribute models (BAMs), motivated by extensive research involving attribute models and their promise in understanding preference structure and enabling more efficient decision-making processes. In a BAM, each alternative has a subset of Boolean attributes, each voter cares about a subset of attributes, and voters prefer alternatives with more of their desired attributes. In the BAM problem, we are given a preference profile and a number k, and want to know whether there is a Boolean k-attribute model explaining the profile. We establish a complexity dichotomy for the number of attributes k: BAM is linear-time solvable for $k \le 2$ but NP-complete for $k \ge 3$. The problem remains hard even when preference orders have length two. On the positive side, BAM becomes fixed-parameter tractable when parameterized by the number of alternatives m. For the special case of two voters, we provide a linear-time algorithm. We also analyze variants where partial information is given: When voter preferences over attributes are known (BAM WITH CARES) or when alternative attributes are specified (BAM WITH HAS), we show that for most parameters BAM WITH CARES is more difficult whereas BAM WITH HAS is more tractable except for being NP-hard even for one voter.

[1200] Asymptotic analysis of cooperative censoring policies in sensor networks

Jesus Fernandez-Bes, Rocío Arroyo-Valles, Jesús Cid-Sueiro

Main category: cs.MA

TL;DR: Analysis of cooperative data censoring in battery-powered multihop sensor networks using Markov Decision Process to optimize energy efficiency by selectively censoring less important messages.

DetailsMotivation: To address energy conservation in battery-powered sensor networks by selectively censoring less important messages to save energy for future communications.

Method: Modeled the problem using joint Markov Decision Process for network dynamics, found theoretically optimal censoring policy, and proposed centralized algorithm for computing constant-threshold rules.

Result: Experimental simulations show cooperative censoring policies are energy-efficient and outperform non-cooperative schemes.

Conclusion: Cooperative data censoring with threshold-based rules provides effective energy optimization in sensor networks, though optimal policies are computationally complex.

Abstract: The problem of cooperative data censoring in battery-powered multihop sensor networks is analyzed in this paper. We are interested in scenarios where nodes generate messages (which are related to the sensor measurements) that can be graded with some importance value. Less important messages can be censored in order to save energy for later communications. The problem is modeled using a joint Markov Decision Process of the whole network dynamics, and a theoretically optimal censoring policy, which maximizes a long-term reward, is found. Though the optimal censoring rules are computationally prohibitive, our analysis suggests that, under some conditions, they can be approximated by a finite collection of constant-threshold rules. A centralized algorithm for the computation of these thresholds is proposed. The experimental simulations show that cooperative censoring policies are energy-efficient, and outperform other non-cooperative schemes.

[1201] Market-Dependent Communication in Multi-Agent Alpha Generation

Jerick Shi, Burton Hollifield

Main category: cs.MA

TL;DR: Communication improves hedge fund performance but optimal structure depends on market volatility - competitive conversation works best for volatile tech stocks, collaborative for stable general stocks, while finance stocks resist communication interventions.

DetailsMotivation: To determine whether and how analysts in multi-strategy hedge funds should communicate to optimize trading performance, given the fundamental organizational choice between isolation and various communication structures.

Method: Used 5-agent LLM-based trading systems across 450 experiments spanning 21 months, comparing five organizational structures from isolated baseline to collaborative and competitive conversation.

Result: Communication improves performance but optimal design depends on market characteristics. Competitive conversation excels in volatile technology stocks, collaborative conversation dominates stable general stocks. Finance stocks resist all communication interventions. All structures converge to similar strategy alignments, challenging assumptions that transparency causes harmful diversity loss.

Conclusion: Optimal communication design must match market volatility characteristics, and sophisticated discussions don’t guarantee better performance. Performance differences stem from behavioral mechanisms rather than conversation quality.

Abstract: Multi-strategy hedge funds face a fundamental organizational choice: should analysts generating trading strategies communicate, and if so, how? We investigate this using 5-agent LLM-based trading systems across 450 experiments spanning 21 months, comparing five organizational structures from isolated baseline to collaborative and competitive conversation. We show that communication improves performance, but optimal communication design depends on market characteristics. Competitive conversation excels in volatile technology stocks, while collaborative conversation dominates stable general stocks. Finance stocks resist all communication interventions. Surprisingly, all structures, including isolated agents, converge to similar strategy alignments, challenging assumptions that transparency causes harmful diversity loss. Performance differences stem from behavioral mechanisms: competitive agents focus on stock-level allocation while collaborative agents develop technical frameworks. Conversation quality scores show zero correlation with returns. These findings demonstrate that optimal communication design must match market volatility characteristics, and sophisticated discussions don’t guarantee better performance.

[1202] Efficient Multiagent Planning via Shared Action Suggestions

Dylan M. Asmar, Mykel J. Kochenderfer

Main category: cs.MA

TL;DR: A communication approach for Dec-POMDPs that shares only suggested joint actions instead of observations, reducing complexity while maintaining near-centralized performance through belief pruning.

DetailsMotivation: Dec-POMDP-Com problems are NEXP-complete and intractable; while sharing observations reduces complexity to PSPACE-complete, we aim to eliminate observation sharing while retaining performance.

Method: Agents communicate suggested joint actions to estimate joint beliefs by pruning infeasible beliefs. Each agent maintains possible belief sets for others and prunes them using suggested actions to form estimated joint beliefs usable with centralized policies.

Result: The approach reduces computational complexity to solving POMDPs for each agent while achieving performance comparable to centralized methods when action sharing enables effective belief pruning, as demonstrated on Dec-POMDP benchmarks.

Conclusion: Action-based communication provides a scalable framework for multiagent planning under uncertainty, enabling human-agent cooperation applications in autonomous systems and teams.

Abstract: Decentralized partially observable Markov decision processes with communication (Dec-POMDP-Com) provide a framework for multiagent decision making under uncertainty, but the NEXP-complete complexity for finite-horizon problems renders solutions intractable in general. While sharing actions and observations can reduce the complexity to PSPACE-complete, we propose an approach that bridges POMDPs and Dec-POMDPs by communicating only suggested joint actions, eliminating the need to share observations while retaining near-centralized performance. Our algorithm estimates joint beliefs using shared actions to prune infeasible beliefs. Each agent maintains possible belief sets for other agents, pruning them based on suggested actions to form an estimated joint belief usable with any centralized policy. This approach requires solving a POMDP for each agent, reducing computational complexity while preserving performance. We demonstrate its effectiveness on several Dec-POMDP benchmarks, showing performance comparable to centralized methods when shared actions enable effective belief pruning. This action-based communication framework offers a natural avenue for integrating human-agent cooperation, opening new directions for scalable multiagent planning under uncertainty, with applications in both autonomous systems and human-agent teams.

[1203] Scalable Satellite Swarm Deployment via Distance-based Orbital Transition Under $J_2$ Perturbation

Yuta Takahashi, Shin-ichiro Sakai

Main category: cs.MA

TL;DR: Autonomous guidance and control for satellite swarms enabling scalable distributed space structures using fuel-free actuation and decentralized deployment control to maintain formation stability.

DetailsMotivation: To enable scalable distributed space structures for innovative science and business opportunities through autonomous satellite swarm guidance and control.

Method: Derived averaged J2 orbital parameters for drift and periodic motion, designed distance-based orbital stabilizer for autonomous deployment into coplanar equidistant configuration, used fuel-free actuation (magnetic field interaction, differential aerodynamic forces), and implemented decentralized deployment controller to minimize drift during communication outages.

Result: Achieved autonomous deployment into monolithic formation of coplanar equidistant configuration on user-defined orbital plane, maintained long-term formation stability without thruster usage, and minimized drift distance during communication outages.

Conclusion: The proposed autonomous guidance and control strategy enables scalable distributed space structures with fuel-free actuation and decentralized control, addressing challenges of unstable orbital dynamics and communication outages in satellite swarms.

Abstract: This paper presents an autonomous guidance and control strategy for a satellite swarm that enables scalable distributed space structures for innovative science and business opportunities. The averaged $J_2$ orbital parameters that describe the drift and periodic orbital motion were derived along with their target values to achieve a distributed space structure in a decentralized manner. This enabled the design of a distance-based orbital stabilizer to ensure autonomous deployment into a monolithic formation of a coplanar equidistant configuration on a user-defined orbital plane. Continuous formation control was assumed to be achieved through fuel-free actuation, such as satellite magnetic field interaction and differential aerodynamic forces, thereby maintaining long-term formation stability without thruster usage. A major challenge for such actuation systems is the potential loss of control capability due to increasing inter-satellite distances resulting from unstable orbital dynamics, particularly for autonomous satellite swarms. To mitigate this risk, our decentralized deployment controller minimized drift distance during unexpected communication outages. As a case study, we consider the deployment of palm-sized satellites into a coplanar equidistant formation in a $J_2$-perturbed orbit. Moreover, centralized grouping strategies are presented.

[1204] LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

Chengwei Lou, Zekai Jin, Wei Tang, Guangfei Geng, Jin Yang, Lu Zhang

Main category: cs.MA

TL;DR: Proposes LLM-MARL framework for real-time P2P electricity markets, using LLMs as experts to guide multi-agent reinforcement learning through imitation learning, achieving better economic costs and grid stability than baselines.

DetailsMotivation: Address challenges in scaling expert guidance for massive personalized prosumers in P2P electricity markets, including diverse decision-making demands, limited technical capability of prosumers, lack of expert experience, and security issues.

Method: Integrated LLM-MARL framework with LLMs as experts generating personalized strategies, guiding MARL under CTDE paradigm through imitation learning, using differential attention-based critic network for enhanced convergence.

Result: LLM-generated strategies effectively substitute human experts. Proposed algorithms achieve significantly lower economic costs and voltage violation rates compared to baseline algorithms while maintaining robust stability.

Conclusion: Provides effective solution for real-time P2P electricity market decision-making by bridging expert knowledge with agent learning, demonstrating practical viability of LLM-guided MARL approach.

Abstract: Real-time peer-to-peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision-making demands and lack of customized modeling frameworks. This paper proposed an integrated large language model-multi-agent reinforcement learning (LLM-MARL) framework for real-time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategy, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation learning. A differential attention-based critic network is designed to enhance convergence performance. Experimental results demonstrate that LLM generated strategies effectively substitute human experts. The proposed multi-agent imitation learning algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baselines algorithms, while maintaining robust stability. This work provides an effective solution for real-time P2P electricity market decision-making by bridging expert knowledge with agent learning.

[1205] Local Guidance for Configuration-Based Multi-Agent Pathfinding

Tomoki Arita, Keisuke Okumura

Main category: cs.MA

TL;DR: Local guidance for multi-agent pathfinding improves solution quality without excessive computational cost, establishing new performance benchmarks when applied to LaCAM solver.

DetailsMotivation: To explore an alternative to global guidance in MAPF by providing local guidance around each agent, addressing potential computational concerns while aiming to improve coordination efficiency.

Method: Providing informative spatiotemporal cues as local guidance in the vicinity of each agent, with recomputation as agents move, applied to the LaCAM configuration-based solver.

Result: Significantly improved solution quality without exceeding moderate time budget, establishing a new performance frontier for MAPF.

Conclusion: Local guidance with spatiotemporal cues can effectively enhance MAPF performance, demonstrating computational feasibility and superior results compared to existing approaches.

Abstract: Guidance is an emerging concept that improves the empirical performance of real-time, sub-optimal multi-agent pathfinding (MAPF) methods. It offers additional information to MAPF algorithms to mitigate congestion on a global scale by considering the collective behavior of all agents across the entire workspace. This global perspective helps reduce agents’ waiting times, thereby improving overall coordination efficiency. In contrast, this study explores an alternative approach: providing local guidance in the vicinity of each agent. While such localized methods involve recomputation as agents move and may appear computationally demanding, we empirically demonstrate that supplying informative spatiotemporal cues to the planner can significantly improve solution quality without exceeding a moderate time budget. When applied to LaCAM, a leading configuration-based solver, this form of guidance establishes a new performance frontier for MAPF.

cs.MM

[1206] ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Jiahui Sun, Weining Wang, Mingzhen Sun, Yirong Yang, Xinxin Zhu, Jing Liu

Main category: cs.MM

TL;DR: ProAV-DiT is a projected latent diffusion transformer that efficiently generates synchronized audio-video content by aligning audio and video in a unified latent space using multi-scale spatiotemporal modeling.

DetailsMotivation: Sounding Video Generation faces challenges due to structural misalignment between audio and video modalities and high computational costs of multimodal processing.

Method: Preprocesses audio into video-like representations, uses Multi-scale Dual-stream Spatio-Temporal Autoencoder for unified latent projection, multi-scale attention mechanisms, and 3D latent space processing with spatio-temporal diffusion Transformer.

Result: Outperforms existing methods in generation quality and computational efficiency on standard benchmarks.

Conclusion: ProAV-DiT effectively addresses structural misalignment and computational challenges in audio-video generation through unified latent space modeling and efficient spatiotemporal processing.

Abstract: Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.

[1207] SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs

Shail Desai, Aditya Pawar, Li Lin, Xin Wang, Shu Hu

Main category: cs.MM

TL;DR: SynthGuard is an open, user-friendly platform for detecting and analyzing AI-generated multimedia using traditional detectors and multimodal large language models (MLLMs).

DetailsMotivation: Address gaps in existing deepfake detection tools which are often closed-source, limited in modality, or lacking transparency and educational value, making it difficult for users to understand detection decisions.

Method: Uses both traditional detectors and multimodal large language models (MLLMs) for detecting AI-generated multimedia, providing explainable inference and unified image and audio support.

Result: Developed SynthGuard platform with interactive interface that makes forensic analysis accessible to researchers, educators, and the public.

Conclusion: SynthGuard addresses the need for transparent, multimodal, and educational AI-generated media detection tools to combat misinformation and erosion of public trust.

Abstract: Artificial Intelligence (AI) has made it possible for anyone to create images, audio, and video with unprecedented ease, enriching education, communication, and creative expression. At the same time, the rapid rise of AI-generated media has introduced serious risks, including misinformation, identity misuse, and the erosion of public trust as synthetic content becomes increasingly indistinguishable from real media. Although deepfake detection has advanced, many existing tools remain closed-source, limited in modality, or lacking transparency and educational value, making it difficult for users to understand how detection decisions are made. To address these gaps, we introduce SynthGuard, an open, user-friendly platform for detecting and analyzing AI-generated multimedia using both traditional detectors and multimodal large language models (MLLMs). SynthGuard provides explainable inference, unified image and audio support, and an interactive interface designed to make forensic analysis accessible to researchers, educators, and the public. The SynthGuard platform is available at: https://in-engr-nova.it.purdue.edu/

[1208] Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives

Yi-Chun Chen

Main category: cs.MM

TL;DR: A hierarchical knowledge graph framework for semantic understanding of visual narratives (comics) that organizes content across panel, event, and macro-event levels using symbolic graphs for interpretable reasoning.

DetailsMotivation: To provide structured semantic understanding of visual narratives with transparency and interpretability, aligning with cognitive theories of event segmentation and visual storytelling.

Method: Hierarchical knowledge graph framework with three levels (panel, event, macro-event) integrating symbolic graphs encoding semantic, spatial, and temporal relationships. Models visual elements (characters, objects, actions) and textual components (dialogue, narration) systematically.

Result: Applied to Manga109 dataset, supports interpretable symbolic reasoning for four tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Emphasizes transparency over predictive performance.

Conclusion: Contributes to explainable narrative analysis and provides foundation for authoring tools, narrative comprehension systems, and interactive media applications.

Abstract: We present a hierarchical knowledge graph framework for the structured semantic understanding of visual narratives, using comics as a representative domain for multimodal storytelling. The framework organizes narrative content across three levels-panel, event, and macro-event, by integrating symbolic graphs that encode semantic, spatial, and temporal relationships. At the panel level, it models visual elements such as characters, objects, and actions alongside textual components including dialogue and narration. These are systematically connected to higher-level graphs that capture narrative sequences and abstract story structures. Applied to a manually annotated subset of the Manga109 dataset, the framework supports interpretable symbolic reasoning across four representative tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Rather than prioritizing predictive performance, the system emphasizes transparency in narrative modeling and enables structured inference aligned with cognitive theories of event segmentation and visual storytelling. This work contributes to explainable narrative analysis and offers a foundation for authoring tools, narrative comprehension systems, and interactive media applications.

[1209] Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao, Wei Song, Derui Wang, Jingling Xue, Jin Song Dong

Main category: cs.MM

TL;DR: VideoLLMs have a critical safety gap where they fail to mention harmful content in videos, with omission rates exceeding 90% due to design flaws in frame sampling, token compression, and weak visual-text connection.

DetailsMotivation: To identify safety vulnerabilities in VideoLLMs where users rely on auto-generated summaries while casually skimming videos, potentially missing harmful content that the models omit from their outputs.

Method: Conducted root-cause analysis revealing three design flaws: insufficient temporal coverage from sparse frame sampling, spatial information loss from aggressive token downsampling, and encoder-decoder disconnection. Developed three zero-query black-box attacks aligned with these flaws.

Result: Large-scale evaluation across five leading VideoLLMs showed harmfulness omission rates exceeding 90% in most cases. Models consistently failed to identify harmful content even when clearly present in all frames.

Conclusion: Current VideoLLMs have fundamental vulnerabilities in their design, highlighting the urgent need for improved sampling strategies, token compression methods, and decoding mechanisms that prioritize semantic coverage over speed alone.

Abstract: Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs’ designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

eess.AS

[1210] How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Minu Kim, Ji Sub Um, Hoirin Kim

Main category: eess.AS

TL;DR: SSL speech models’ tone perception varies by downstream task: ASR fine-tuning aligns with language-specific tone cues (100-180ms), while prosody/voice tasks bias toward longer spans.

DetailsMotivation: Lexical tone is crucial in many languages but understudied in SSL speech models, especially beyond Mandarin. Need to understand how these models perceive tone and how transfer works in low-resource settings.

Method: Study four languages with complex tone systems (Burmese, Thai, Lao, Vietnamese). Use baseline tone cue temporal span estimation, probes, and gradient analyses on fine-tuned SSL models across different downstream tasks.

Result: Tone transfer varies by task: ASR fine-tuning aligns temporal spans with language-specific tone cues (100ms for Burmese/Thai, 180ms for Lao/Vietnamese), while prosody/voice tasks bias models toward longer spans.

Conclusion: Tone transfer in SSL models is shaped by downstream task, demonstrating task effects on temporal focus in tone modeling.

Abstract: Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

[1211] VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath

Main category: eess.AS

TL;DR: VoiceCraft-X is a unified autoregressive neural codec model for multilingual speech editing and zero-shot TTS across 11 languages, using Qwen3 LLM for text processing and novel token reordering.

DetailsMotivation: To create a single framework that handles both multilingual speech editing and zero-shot TTS synthesis across diverse languages, addressing the need for unified approaches in real-world multilingual speech applications.

Method: Uses Qwen3 large language model for phoneme-free cross-lingual text processing with a novel token reordering mechanism that aligns text and speech tokens, treating both tasks as sequence generation problems.

Result: Generates high-quality, natural-sounding speech and enables seamless audio creation/editing across 11 languages, showing robust performance even with limited per-language data.

Conclusion: Demonstrates the power of unified autoregressive approaches for advancing complex multilingual speech applications, with strong performance across diverse linguistic settings.

Abstract: We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

[1212] Eardrum sound pressure prediction from ear canal reflectance based on the inverse solution of Webster’s horn equation

Reinhild Roden, Tobias Sankowsky-Rothe, Nick Wulbusch, Alexey Chernov, Matthias Blau

Main category: eess.AS

TL;DR: This paper improves methods for estimating individual ear canal area functions from acoustic measurements to enable personalized hearing aid equalization.

DetailsMotivation: Individual ear canal models are needed for personalized hearing system equalization, requiring accurate estimation of ear canal area functions from limited measurement data.

Method: Used inverse solution of Webster’s horn equation with finite difference approximation of time domain reflectance, optimized termination at optimal spatial resolution, and extrapolated simulated input impedances up to 3.5 MHz (0.1 mm resolution).

Result: Achieved more precise area functions compared to geometric reference, successfully replicated 3D simulated and measured ear canal transfer impedances using validated 1D electro-acoustic model.

Conclusion: The improved method provides robust criteria for terminating area function approximation and enables accurate individual ear canal modeling for hearing system equalization.

Abstract: To derive ear canal transfer functions for individualized equalization algorithms of in-ear hearing systems, individual ear canal models are needed. In a one-dimensional approach, this requires the estimation of the individual area function of the ear canal. The area function can be effectively and reproducibly calculated as the inverse solution of Webster’s horn equation by finite difference approximation of the time domain reflectance. Building upon previous research, the present study further investigates the termination of the approximation at an optimal spatial resolution, addressing the absence of higher frequencies in typical ear canal measurements and enhancing the accuracy of the inverse solution. Compared to the geometric reference, more precise area functions were achieved by extrapolating simulated input impedances of ear canal geometries up to a frequency of 3.5 MHz, corresponding to 0.1 mm spatial resolution. The low pass of the previous work was adopted but adjusted for its cut-off frequency depending on the highest frequency of the band-limited input impedance. Robust criteria for terminating the area function at the approximated ear canal length were found. Finally, three-dimensional simulated and measured ear canal transfer impedances were replicated well employing the previously introduced and herein validated one-dimensional electro-acoustic model fed by the area functions.

[1213] PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

Main category: eess.AS

TL;DR: PASE is a generative speech enhancement framework that uses WavLM’s phonological prior to reduce linguistic and acoustic hallucinations in noisy speech, achieving better perceptual quality than discriminative models.

DetailsMotivation: Existing generative speech enhancement models suffer from linguistic hallucinations (incorrect content) and acoustic hallucinations (inconsistent speaker characteristics) under severe noise conditions, which current approaches fail to adequately address.

Method: Adapt WavLM into a denoising expert via representation distillation to clean its final-layer features, leveraging its robust phonological prior. Train a vocoder with dual-stream representations: high-level phonetic for clean content and low-level acoustic for speaker identity.

Result: PASE surpasses state-of-the-art discriminative models in perceptual quality and significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

Conclusion: The proposed PASE framework effectively mitigates hallucinations in generative speech enhancement by leveraging phonological priors from pre-trained models, demonstrating superior performance over existing approaches.

Abstract: Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively. We argue that linguistic hallucination stems from models’ failure to constrain valid phonological structures and it is a more fundamental challenge. While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations. To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model’s intrinsic phonological prior, this process enables robust denoising while minimizing linguistic hallucinations. To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody. Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

[1214] Systematic evaluation of time-frequency features for binaural sound source localization

Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines

Main category: eess.AS

TL;DR: Systematic evaluation of time-frequency features for binaural sound source localization shows that optimal feature combinations (ILD+IPD+spectrograms) outperform model complexity increases, with low-complexity CNN achieving competitive performance.

DetailsMotivation: To understand how feature selection impacts binaural sound source localization performance across different conditions and provide guidance for both domain-specific and general-purpose localization.

Method: Evaluated CNN model performance using various combinations of amplitude-based features (magnitude spectrogram, ILD) and phase-based features (phase spectrogram, IPD) on in-domain and out-of-domain data with mismatched HRTFs.

Result: Carefully chosen feature combinations often outperform increases in model complexity. ILD+IPD sufficient for in-domain SSL, but generalization requires richer inputs combining channel spectrograms with both ILD and IPD.

Conclusion: Feature design is crucial for binaural SSL, with optimal feature sets enabling low-complexity models to achieve competitive performance across diverse conditions.

Abstract: This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

[1215] Study on the Fairness of Speaker Verification Systems on Underrepresented Accents in English

Mariel Estevez, Luciana Ferrer

Main category: eess.AS

TL;DR: Analysis of speaker verification system fairness across accent groups shows calibration bias that can be mitigated through data balancing.

DetailsMotivation: Ensure SV systems used for sensitive decisions (bank access, criminal cases) are fair and don't disadvantage any accent groups.

Method: Curated new dataset from VoxCeleb with speakers from different accent countries; evaluated multiple SV systems; tested data balancing approach with discriminative condition-aware backend.

Result: Discrimination performance robust across accents, but calibration degrades dramatically for underrepresented accents in training data.

Conclusion: Simple data balancing effectively mitigates calibration bias, especially when combined with discriminative condition-aware backend.

Abstract: Speaker verification (SV) systems are currently being used to make sensitive decisions like giving access to bank accounts or deciding whether the voice of a suspect coincides with that of the perpetrator of a crime. Ensuring that these systems are fair and do not disfavor any particular group is crucial. In this work, we analyze the performance of several state-of-the-art SV systems across groups defined by the accent of the speakers when speaking English. To this end, we curated a new dataset based on the VoxCeleb corpus where we carefully selected samples from speakers with accents from different countries. We use this dataset to evaluate system performance for several SV systems trained with VoxCeleb data. We show that, while discrimination performance is reasonably robust across accent groups, calibration performance degrades dramatically on some accents that are not well represented in the training data. Finally, we show that a simple data balancing approach mitigates this undesirable bias, being particularly effective when applied to our recently-proposed discriminative condition-aware backend.

[1216] Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis

Théodor Lemerle, Téo Guichoux, Axel Roebel, Nicolas Obin

Main category: eess.AS

TL;DR: Lina-Speech is a TTS model using Gated Linear Attention to replace standard self-attention, improving throughput while maintaining performance. It introduces Initial-State Tuning for multi-sample voice cloning and style adaptation.

DetailsMotivation: Current neural codec language models have limited context length, restricting voice cloning to short samples and hindering prosody/style diversity. They also struggle with prosody/emotion adaptation and have quadratic complexity limiting throughput.

Method: Uses Gated Linear Attention (GLA) instead of standard self-attention as backbone. Introduces Initial-State Tuning (IST) strategy leveraging recurrent architecture’s stateful property for multi-sample conditioning of arbitrary lengths.

Result: Improves inference throughput while matching state-of-the-art performance. Enables comprehensive voice cloning and out-of-domain speaking style/emotion adaptation with fine-grained control over prosody and emotion.

Conclusion: Lina-Speech provides an efficient solution for voice cloning and style adaptation, overcoming limitations of current models through GLA and IST strategies.

Abstract: Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker’s prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of arbitrary numbers and lengths and provides a comprehensive and efficient strategy for voice cloning and out-of-domain speaking style and emotion adaptation. We demonstrate the effectiveness of this approach for controlling fine-grained characteristics such as prosody and emotion. Code, checkpoints, and demo are freely available: https://github.com/theodorblackbird/lina-speech

[1217] AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu

Main category: eess.AS

TL;DR: AHAMask masks specific attention heads in LALMs to trigger acoustic task functionalities without instructions, achieving comparable or better performance than instruction-based methods.

DetailsMotivation: Current large audio language models suffer from prompt sensitivity, where different instructions of the same intention yield drastically different outcomes.

Method: Propose AHAMask that masks some attention heads in the decoder-only LLM backbone of LALMs to trigger specific acoustic task functionalities without instructions. Masks are efficiently obtained by training with parameters equal to attention head count.

Result: Applying selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks.

Conclusion: AHAMask enables reliable acoustic task specification for LALMs and reveals that LALMs exhibit ‘functional pathways’ in their attention heads.

Abstract: Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain “functional pathways” in their attention heads.

eess.IV

[1218] Slow - Motion Video Synthesis for Basketball Using Frame Interpolation

Jiantang Huang

Main category: eess.IV

TL;DR: Fine-tuned RIFE model achieves real-time 4x slow-motion for basketball videos with improved quality over baseline methods.

DetailsMotivation: Traditional basketball broadcasts at 30-60 fps limit appreciation of rapid plays like dunks and crossovers, requiring better slow-motion synthesis.

Method: Fine-tuned Real-Time Intermediate Flow Estimation (RIFE) network on basketball subset of SportsSloMo dataset with human-aware random cropping.

Result: Fine-tuned RIFE achieved 34.3 dB PSNR and 0.949 SSIM, outperforming Super SloMo by 2.1 dB and baseline RIFE by 1.3 dB, running at ~30 fps on RTX 4070 Ti Super.

Conclusion: Task-specific adaptation is crucial for sports slow-motion, and RIFE provides attractive accuracy-speed trade-off for consumer applications.

Abstract: Basketball broadcast footage is traditionally captured at 30-60 fps, limiting viewers’ ability to appreciate rapid plays such as dunks and crossovers. We present a real-time slow-motion synthesis system that produces high-quality basketball-specific interpolated frames by fine-tuning the recent Real-Time Intermediate Flow Estimation (RIFE) network on the SportsSloMo dataset. Our pipeline isolates the basketball subset of SportsSloMo, extracts training triplets, and fine-tunes RIFE with human-aware random cropping. We compare the resulting model against Super SloMo and the baseline RIFE model using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) on held-out clips. The fine-tuned RIFE attains a mean PSNR of 34.3 dB and SSIM of 0.949, outperforming Super SloMo by 2.1 dB and the baseline RIFE by 1.3 dB. A lightweight Gradio interface demonstrates end-to-end 4x slow-motion generation on a single RTX 4070 Ti Super at approximately 30 fps. These results indicate that task-specific adaptation is crucial for sports slow-motion, and that RIFE provides an attractive accuracy-speed trade-off for consumer applications.

[1219] Weyl-Heisenberg Transform Capabilities in JPEG Compression Standard

V. Asiryan, V. Volchkov, N. Papulovskaya

Main category: eess.IV

TL;DR: A new JPEG compression technology using Weyl-Heisenberg bases (WH-technology) replaces DCT with discrete orthogonal Weyl-Heisenberg transform (DWHT) for better compression efficiency.

DetailsMotivation: To overcome limitations of JPEG standard and enhance compression efficiency by addressing DCT's limitations in decorrelation and compression.

Method: Replaced discrete cosine transform (DCT) with real version of 2D discrete orthogonal Weyl-Heisenberg transform (DWHT) in JPEG compression algorithm, leveraging DWHT’s block structure and optimal signal basis properties.

Result: Experimental study confirmed higher compression efficiency compared to standard JPEG compression, with more efficient decorrelation and compression of element values in image blocks.

Conclusion: The proposed WH-technology based JPEG algorithm using DWHT provides superior compression performance over traditional JPEG standard.

Abstract: This paper is devoted to the development and research of a new compression technology based on Weyl-Heisenberg bases (WH-technology) for modifying the JPEG compression standard and improving its characteristics. For this purpose, the paper analyzes the main stages of the JPEG compression algorithm, notes its key features and problems that limit further enhancement of its efficiency. To overcome these limitations, it is proposed to use the real version of the two-dimensional discrete orthogonal Weyl-Heisenberg transform (DWHT) instead of the discrete cosine transform (DCT) at the stage of transformation coding. This transformation, unlike DCT, initially has a block structure and is built on the basis of the Weyl-Heisenberg optimal signal basis, the functions of which are orthogonal and well localized both in the frequency and time domains. This feature of DWHT allows for more efficient decorrelation and compression of element values in each block of the image after transformation coding. As a result, it is possible to obtain more efficient selection and screening of insignificant elements at the subsequent stages of quantization and information coding. Based on DWHT, a new version of the JPEG compression algorithm was developed, and convenient criteria for evaluating the compression efficiency and metrics of quality losses were proposed. The results of an experimental study are presented, confirming the higher compression efficiency of the proposed algorithm in comparison with the JPEG compression standard.

[1220] A Deep Learning Framework for Thyroid Nodule Segmentation and Malignancy Classification from Ultrasound Images

Omar Abdelrazik, Mohamed Elsayed, Noorul Wahab, Nasir Rajpoot, Adam Shephard

Main category: eess.IV

TL;DR: A fully automated two-stage framework for interpretable thyroid nodule malignancy prediction using TransUNet for segmentation and ResNet-18 for classification, achieving high performance with clinical interpretability.

DetailsMotivation: Address high inter-observer variability in ultrasound-based thyroid nodule risk stratification and overcome the black box nature of many deep learning models by creating an interpretable system.

Method: Two-stage framework: 1) TransUNet automatically segments thyroid nodules, 2) Region of interest from segmentation is fed into ResNet-18 classifier for malignancy prediction.

Result: Achieved F1-score of 0.852 on clinical dataset of 349 images, outperforming Random Forest baseline with hand-crafted features (F1-score 0.829).

Conclusion: The framework demonstrates that implicit visual features from localized nodules are more predictive than explicit shape features, providing the first fully automated end-to-end pipeline for thyroid nodule detection and malignancy prediction.

Abstract: Ultrasound-based risk stratification of thyroid nodules is a critical clinical task, but it suffers from high inter-observer variability. While many deep learning (DL) models function as “black boxes,” we propose a fully automated, two-stage framework for interpretable malignancy prediction. Our method achieves interpretability by forcing the model to focus only on clinically relevant regions. First, a TransUNet model automatically segments the thyroid nodule. The resulting mask is then used to create a region of interest around the nodule, and this localised image is fed directly into a ResNet-18 classifier. We evaluated our framework using 5-fold cross-validation on a clinical dataset of 349 images, where it achieved a high F1-score of 0.852 for predicting malignancy. To validate its performance, we compared it against a strong baseline using a Random Forest classifier with hand-crafted morphological features, which achieved an F1-score of 0.829. The superior performance of our DL framework suggests that the implicit visual features learned from the localised nodule are more predictive than explicit shape features alone. This is the first fully automated end-to-end pipeline for both detecting thyroid nodules on ultrasound images and predicting their malignancy.

[1221] Noisy MRI Reconstruction via MAP Estimation with an Implicit Deep-Denoiser Prior

Nikola Janjušević, Amirhossein Khalilian-Gourtani, Yao Wang, Li Feng

Main category: eess.IV

TL;DR: ImMAP is a diffusion-based MRI reconstruction framework that integrates acquisition noise models into a maximum a posteriori formulation, outperforming state-of-the-art methods under realistic noise conditions.

DetailsMotivation: Existing diffusion models for MRI reconstruction lack explicit links to MRI physics and are sensitive to measurement noise, limiting their practical reliability.

Method: Builds on stochastic ascent method and generalizes it to handle MRI encoding operators and realistic measurement noise in a MAP formulation.

Result: Consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods on both simulated and real noisy datasets.

Conclusion: ImMAP establishes a more reliable and interpretable diffusion-based reconstruction framework by clarifying practical behavior under realistic noise conditions.

Abstract: Accelerating magnetic resonance imaging (MRI) remains challenging, particularly under realistic acquisition noise. While diffusion models have recently shown promise for reconstructing undersampled MRI data, many approaches lack an explicit link to the underlying MRI physics, and their parameters are sensitive to measurement noise, limiting their reliability in practice. We introduce Implicit-MAP (ImMAP), a diffusion-based reconstruction framework that integrates the acquisition noise model directly into a maximum a posteriori (MAP) formulation. Specifically, we build on the stochastic ascent method of Kadkhodaie et al. and generalize it to handle MRI encoding operators and realistic measurement noise. Across both simulated and real noisy datasets, ImMAP consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods. By clarifying the practical behavior and limitations of diffusion models under realistic noise conditions, ImMAP establishes a more reliable and interpretable

[1222] Volumetric Ultrasound via 3D Null Subtraction Imaging with Circular and Spiral Apertures

Bingze Dai, Xi Zhang, Wei-Ning Lee

Main category: eess.IV

TL;DR: 3D NSI is a nonlinear beamforming framework that improves volumetric ultrasound imaging by combining null-subtraction processing with sparse aperture designs, achieving higher resolution and contrast while enabling real-time 4D imaging at over 1000 volumes/second.

DetailsMotivation: Volumetric ultrasound imaging faces fundamental trade-offs between image quality, frame rate, and hardware complexity that limit practical applications.

Method: Combines null-subtraction beamforming with multiplexing-aware sparse aperture designs on matrix arrays, including Fermat’s spiral sparse apertures and a novel spiral “no-reuse” apodization that prevents element overlap across transmit-receive events.

Result: Achieved 36% improvement in azimuthal and elevational resolution, 20% higher contrast ratio compared to conventional DAS beamforming, and enabled 16-fold increase in acquisition volume rate using only 240 active elements on a 1024-element probe.

Conclusion: 3D NSI provides a practical solution for real-time 4D imaging with computational load less than three times that of DAS, making it suitable for clinical applications requiring high-speed volumetric imaging.

Abstract: Volumetric ultrasound imaging faces a fundamental trade-off among image quality, frame rate, and hardware complexity. This study introduces three-dimensional Null Subtraction Imaging (3D NSI), a nonlinear beamforming framework that addresses this trade-off by combining computationally efficient null-subtraction process with multiplexing-aware sparse aperture designs on matrix arrays. We evaluate three apodization configurations: a fully addressed circular aperture and two Fermat’s spiral sparse apertures. To overcome channel-sharing constraints common in matrix arrays multiplexed with low-channel-count ultrasound systems, we propose a spiral “no-reuse” apodization that enforces non-overlapping element sets across transmit-receive events. This design resolves multiplexing conflicts and enables up to a 16-fold increase in acquisition volume rate using only 240 active elements on a 1024-element probe. In computer simulations and tissue-mimicking phantom experiments, 3D NSI achieved an average improvement of 36% in azimuthal and elevational resolutions, along with an approximately 20% higher contrast ratio, compared to the conventional Delay-and-Sum (DAS) beamformer under matched transmit/receive configurations. When implemented with the spiral no-reuse aperture, the 3D NSI framework achieved over 1000 volumes per second with a computational load less than three times that of DAS, making it a practical solution for real-time 4D imaging.

[1223] Recursive Threshold Median Filter and Autoencoder for Salt-and-Pepper Denoising: SSIM analysis of Images and Entropy Maps

Petr Boriskov, Kirill Rudkovskii, Andrei Velichko

Main category: eess.IV

TL;DR: The paper proposes two scalable denoising schemes using median filters and autoencoders, introduces SSIMMap as a complementary metric to SSIMImg for better blur assessment, and shows median filters outperform autoencoders for high noise levels.

DetailsMotivation: To develop effective salt-and-pepper noise removal methods that work well under strong noise conditions while being computationally efficient for resource-constrained platforms.

Method: Uses median filters and simple three-layer autoencoders within recursive threshold algorithm, proposing two scalable schemes: 2MF (two MFs with different window sizes) and MFs-AE (aggregating features from multiple MFs via AE).

Result: Recursive threshold MF robustly restores images even under 50-60% noise, while simple AE only works for <30% noise. 2MF highlights sharp local details at low resolution, and MFs-AE restores overall scene structure at higher resolution.

Conclusion: MF remains preferable for edge/IoT deployment due to simplicity and efficiency, while AE underperforms without prior denoising. SSIMMap proves valuable for objective blur assessment and parameter tuning.

Abstract: This paper studies the removal of salt-and-pepper noise from images using median filter (MF) and simple three-layer autoencoder (AE) within recursive threshold algorithm. The performance of denoising is assessed with two metrics: the standard Structural Similarity Index SSIMImg of restored and clean images and a newly applied metric SSIMMap - the SSIM of entropy maps of these images computed via 2D Sample Entropy in sliding windows. We shown that SSIMMap is more sensitive to blur and local intensity transitions and complements SSIMImg. Experiments on low- and high-resolution grayscales images demonstrate that recursive threshold MF robustly restores images even under strong noise (50-60 %), whereas simple AE is only capable of restoring images with low levels of noise (<30 %). We propose two scalable schemes: (i) 2MF, which uses two MFs with different window sizes and a final thresholding step, effective for highlighting sharp local details at low resolution; and (ii) MFs-AE, which aggregates features from multiple MFs via an AE and is beneficial for restoring the overall scene structure at higher resolution. Owing to its simplicity and computational efficiency, MF remains preferable for deployment on resource-constrained platforms (edge/IoT), whereas AE underperforms without prior denoising. The results also validate the practical value of SSIMMap for objective blur assessment and denoising parameter tuning.

[1224] Deep Unfolded BM3D: Unrolling Non-local Collaborative Filtering into a Trainable Neural Network

Kerem Basim, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Main category: eess.IV

TL;DR: DU-BM3D combines BM3D’s non-local self-similarity with U-Net’s learnable denoising, outperforming both methods in low-dose CT denoising across various noise levels.

DetailsMotivation: BM3D has fixed parameters and lacks adaptability, while deep models like U-Net lack interpretability and fail to generalize across noise regimes. A hybrid approach is needed to combine the strengths of both.

Method: Unroll BM3D into a trainable architecture by replacing its fixed collaborative filtering with a learnable U-Net denoiser, preserving non-local structural priors while enabling end-to-end optimization.

Result: DU-BM3D outperforms classic BM3D and standalone U-Net across simulated low-dose CT at different noise levels, achieving higher PSNR and SSIM, especially in high-noise conditions.

Conclusion: The proposed DU-BM3D framework successfully combines the interpretability of BM3D with the flexibility of deep learning, providing superior denoising performance while maintaining structural priors.

Abstract: Block-Matching and 3D Filtering (BM3D) exploits non-local self-similarity priors for denoising but relies on fixed parameters. Deep models such as U-Net are more flexible but often lack interpretability and fail to generalize across noise regimes. In this study, we propose Deep Unfolded BM3D (DU-BM3D), a hybrid framework that unrolls BM3D into a trainable architecture by replacing its fixed collaborative filtering with a learnable U-Net denoiser. This preserves BM3D’s non-local structural prior while enabling end-to-end optimization. We evaluate DU-BM3D on low-dose CT (LDCT) denoising and show that it outperforms classic BM3D and standalone U-Net across simulated LDCT at different noise levels, yielding higher PSNR and SSIM, especially in high-noise conditions.

[1225] Multimodal RGB-HSI Feature Fusion with Patient-Aware Incremental Heuristic Meta-Learning for Oral Lesion Classification

Rupam Mukherjee, Rajkumar Daniel, Soujanya Hazra, Shirin Dasgupta, Subhamoy Mandal

Main category: eess.IV

TL;DR: A unified four-class oral lesion classifier combining deep RGB embeddings, hyperspectral reconstruction, handcrafted features, and demographic data achieves improved oral cancer screening in low-resource settings.

DetailsMotivation: Early detection of oral cancer is challenging in low-resource settings due to limited annotated data, requiring robust automated screening methods.

Method: Uses fine-tuned ConvNeXt-v2 encoder for RGB embeddings, RGB-to-HSI reconstruction for 31-band hyperspectral cubes, extracts haemoglobin-sensitive indices and spectral-textural features, and introduces incremental heuristic meta-learner (IHML) with probabilistic stacking.

Result: Achieved 66.23% macro F1 and 64.56% accuracy on unseen patient split, demonstrating substantial improvement in robustness for oral lesion screening.

Conclusion: Hyperspectral reconstruction and uncertainty-aware meta-learning significantly enhance the robustness of oral lesion classification for real-world screening applications.

Abstract: Early detection of oral cancer and potentially malignant disorders is challenging in low-resource settings due to limited annotated data. We present a unified four-class oral lesion classifier that integrates deep RGB embeddings, hyperspectral reconstruction, handcrafted spectral-textural descriptors, and demographic metadata. A pathologist-verified subset of oral cavity images was curated and processed using a fine-tuned ConvNeXt-v2 encoder, followed by RGB-to-HSI reconstruction into 31-band hyperspectral cubes. Haemoglobin-sensitive indices, texture features, and spectral-shape measures were extracted and fused with deep and clinical features. Multiple machine-learning models were assessed with patient-wise validation. We further introduce an incremental heuristic meta-learner (IHML) that combines calibrated base classifiers through probabilistic stacking and patient-level posterior smoothing. On an unseen patient split, the proposed framework achieved a macro F1 of 66.23% and an accuracy of 64.56%. Results demonstrate that hyperspectral reconstruction and uncertainty-aware meta-learning substantially improve robustness for real-world oral lesion screening.

[1226] RAA-MIL: A Novel Framework for Classification of Oral Cytology

Rupam Mukherjee, Rajkumar Daniel, Soujanya Hazra, Shirin Dasgupta, Subhamoy Mandal

Main category: eess.IV

TL;DR: First weakly supervised deep learning framework for patient-level diagnosis of oral cytology whole slide images using multiple-instance learning with spatial modeling, achieving 72.7% accuracy.

DetailsMotivation: Manual examination of cytology whole slide images for oral cancer detection is slow, subjective, and dependent on expert pathologists, creating need for automated AI solutions.

Method: Proposed Region-Affinity Attention MIL (RAA-MIL) that models spatial relationships between regions within slides, using patient-level weak labels from annotated cytology WSIs across ten medical centers.

Result: RAA-MIL achieves 72.7% average accuracy and 0.69 weighted F1-score on unseen test set, outperforming baseline MIL model.

Conclusion: Establishes first patient-level weakly supervised benchmark for oral cytology and advances toward reliable AI-assisted digital pathology.

Abstract: Cytology is a valuable tool for early detection of oral squamous cell carcinoma (OSCC). However, manual examination of cytology whole slide images (WSIs) is slow, subjective, and depends heavily on expert pathologists. To address this, we introduce the first weakly supervised deep learning framework for patient-level diagnosis of oral cytology whole slide images, leveraging the newly released Oral Cytology Dataset [1], which provides annotated cytology WSIs from ten medical centres across India. Each patient case is represented as a bag of cytology patches and assigned a diagnosis label (Healthy, Benign, Oral Potentially Malignant Disorders (OPMD), OSCC) by an in-house expert pathologist. These patient-level weak labels form a new extension to the dataset. We evaluate a baseline multiple-instance learning (MIL) model and a proposed Region-Affinity Attention MIL (RAA-MIL) that models spatial relationships between regions within each slide. The RAA-MIL achieves an average accuracy of 72.7%, weighted F1-score of 0.69 on an unseen test set, outperforming the baseline. This study establishes the first patient-level weakly supervised benchmark for oral cytology and moves toward reliable AI-assisted digital pathology.

[1227] MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

Fan Li, Arun Iyengar, Lanyu Xu

Main category: eess.IV

TL;DR: MTMed3D is a multi-task Transformer-based model that jointly performs 3D detection, segmentation, and classification in medical imaging, achieving comparable performance to single-task models with reduced computational costs.

DetailsMotivation: Single-task models in medical imaging overlook shared information across tasks, leading to inefficiencies in real-life applications. Multi-task learning can address this limitation.

Method: Proposes MTMed3D using Transformer as shared encoder for multi-scale features, followed by CNN-based task-specific decoders for joint 3D detection, segmentation, and classification.

Result: Achieved promising results on BraTS 2018/2019 datasets, especially in detection (better than prior works). Multi-task model reduces computational costs and achieves faster inference while maintaining comparable performance to single-task variants.

Conclusion: First work to leverage Transformers for multi-task learning covering detection, segmentation, and classification in 3D medical imaging, showing potential to enhance diagnostic processes.

Abstract: In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.

[1228] Fine-grained Image Quality Assessment for Perceptual Image Restoration

Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, Leida Li

Main category: eess.IV

TL;DR: The paper introduces FGRestore, the first fine-grained image quality assessment dataset for image restoration, and proposes FGResQ, a new IQA model that combines coarse-grained score regression with fine-grained quality ranking to better evaluate restored images.

DetailsMotivation: Existing image quality assessment (IQA) metrics are inadequate for perceptual image restoration tasks, particularly in distinguishing fine-grained quality differences among restored images, creating a need for more accurate evaluation methods.

Method: Created FGRestore dataset with 18,408 restored images across six IR tasks and 30,886 pairwise preferences. Proposed FGResQ model that performs both coarse-grained score regression and fine-grained quality ranking specifically for image restoration.

Result: FGResQ significantly outperforms state-of-the-art IQA metrics in extensive experiments and comparisons, demonstrating better alignment with fine-grained restoration quality.

Conclusion: The proposed FGResQ model addresses the limitations of existing IQA metrics for image restoration tasks and provides more accurate quality assessment through its dual approach of score regression and quality ranking.

Abstract: Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://sxfly99.github.io/FGResQ-Homepage.

[1229] DEMIST: \underline{DE}coupled \underline{M}ulti-stream latent d\underline{I}ffusion for Quantitative Myelin Map \underline{S}yn\underline{T}hesis

Jiacheng Wang, Hao Li, Xing Yao, Ahmad Toubasi, Taegan Vinarsky, Caroline Gheen, Joy Derwenskus, Chaoyang Jin, Richard Dortch, Junzhong Xu, Francesca Bagnato, Ipek Oguz

Main category: eess.IV

TL;DR: DEMIST synthesizes quantitative magnetization transfer (qMT) pool size ratio (PSR) maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms, eliminating the need for specialized 20-30 minute qMT scans.

DetailsMotivation: qMT imaging provides valuable myelin-sensitive biomarkers for multiple sclerosis assessment but requires specialized long scans (20-30 minutes). The goal is to enable PSR map generation from standard clinical images.

Method: Two-stage approach: 1) Train separate autoencoders for PSR and anatomical images to learn aligned latent representations. 2) Train conditional diffusion model in latent space using frozen diffusion foundation backbone with three conditioning mechanisms: semantic tokens via cross-attention, spatial per-scale residual hints via 3D ControlNet, and adaptive LoRA-modulated attention. Includes edge-aware loss and alignment losses.

Result: Evaluated on 163 scans from 99 subjects using 5-fold cross-validation. Outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth PSR maps.

Conclusion: DEMIST successfully synthesizes high-quality PSR maps from standard clinical images, providing a practical alternative to specialized qMT scans for multiple sclerosis assessment while maintaining quantitative consistency and preserving lesion boundaries.

Abstract: Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbf{semantic} tokens via cross-attention, (ii) \textbf{spatial} per-scale residual hints via a 3D ControlNet branch, and (iii) \textbf{adaptive} LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at https://github.com/MedICL-VU/MS-Synthesis-3DcLDM.

[1230] A Multicollinearity-Aware Signal-Processing Framework for Cross-$β$ Identification via X-ray Scattering of Alzheimer’s Tissue

Abdullah Al Bashit, Prakash Nepal, Lee Makowski

Main category: eess.IV

TL;DR: A three-stage classification framework for detecting cross-β inclusions in Alzheimer’s disease using X-ray scattering data, addressing challenges like substrate contamination and feature correlation.

DetailsMotivation: X-ray scattering measurements of human brain tissue contain structural signatures of pathological cross-β inclusions (hallmark of Alzheimer's), but automated detection is challenging due to substrate contamination, strong feature correlations, and limited sample sizes.

Method: Three-stage framework: 1) Bayes-optimal classifier separates mica substrate from tissue regions; 2) Multicollinearity-aware correlation pruning with formal guarantees on Bayes risk; 3) Compact neural network trained on pruned features with composite Focal+Dice loss.

Result: Top-performing model achieves 84.30% F1-score using only 11 of 211 candidate features and 174 trainable parameters, demonstrating efficient feature selection and classification.

Conclusion: The framework provides an interpretable, theory-grounded strategy for data-limited classification problems with correlated, high-dimensional experimental measurements, particularly useful for neurodegenerative tissue analysis.

Abstract: X-ray scattering measurements of in situ human brain tissue encode structural signatures of pathological cross-$β$ inclusions, yet systematic exploitation of these data for automated detection remains challenging due to substrate contamination, strong inter-feature correlations, and limited sample sizes. This work develops a three-stage classification framework for identifying cross-$β$ structural inclusions-a hallmark of Alzheimer’s disease-in X-ray scattering profiles of post-mortem human brain. Stage 1 employs a Bayes-optimal classifier to separate mica substrate from tissue regions on the basis of their distinct scattering signatures. Stage 2 introduces a multicollinearityaware, class-conditional correlation pruning scheme with formal guarantees on the induced Bayes risk and approximation error, thereby reducing redundancy while retaining class-discriminative information. Stage 3 trains a compact neural network on the pruned feature set to detect the presence or absence of cross-$β$ fibrillar ordering. The top-performing model, optimized with a composite loss combining Focal and Dice objectives, attains a test F1-score of 84.30% using 11 of 211 candidate features and 174 trainable parameters. The overall framework yields an interpretable, theory-grounded strategy for data-limited classification problems involving correlated, high-dimensional experimental measurements, exemplified here by X-ray scattering profiles of neurodegenerative tissue.

[1231] Diffusion Algorithm for Metalens Optical Aberration Correction

Harshana Weligampola, Yuanrui Chen, Abhiram Gnanasambandam, Weiheng Tang, Dilshan Godaliyadda, Hamid R. Sheikh, Qi Guo, Stanley H. Chan

Main category: eess.IV

TL;DR: A dual-branch diffusion model using Stable Diffusion XL to reconstruct sharp, full-color images from metalens-captured inputs: a sharp grayscale structure image and distorted color cue image.

DetailsMotivation: Metalenses suffer from severe optical aberrations, especially chromatic aberration, making image reconstruction challenging. Existing methods struggle with restoring high-quality full-color images from metalens captures.

Method: Uses a dual-branch diffusion model built on pre-trained Stable Diffusion XL framework to fuse information from two inputs: sharp bandpass-filtered grayscale structure image and distorted color cue image.

Result: Significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.

Conclusion: The proposed algorithmic solution successfully addresses metalens chromatic aberration problems, enabling high-quality full-color image reconstruction from metalens systems.

Abstract: Metalenses offer a path toward creating ultra-thin optical systems, but they inherently suffer from severe, spatially varying optical aberrations, especially chromatic aberration, which makes image reconstruction a significant challenge. This paper presents a novel algorithmic solution to this problem, designed to reconstruct a sharp, full-color image from two inputs: a sharp, bandpass-filtered grayscale structure image'' and a heavily distorted color cue’’ image, both captured by the metalens system. Our method utilizes a dual-branch diffusion model, built upon a pre-trained Stable Diffusion XL framework, to fuse information from the two inputs. We demonstrate through quantitative and qualitative comparisons that our approach significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.

[1232] Improving the Generalisation of Learned Reconstruction Frameworks

Emilien Valat, Ozan Öktem

Main category: eess.IV

TL;DR: GLM introduces a graph-based neural network for CT imaging that outperforms CNNs with fewer parameters, better generalization to unseen acquisition geometries, and improved computational efficiency.

DetailsMotivation: CNNs are ill-suited for CT inverse problems as they apply grid convolutions to sinogram data that lies on a line manifold, requiring excessive parameters and lacking geometric awareness.

Method: Proposes GLM - a hybrid architecture using graph convolutions to represent CT acquisition geometries and tomographic data, combined with grid convolutions for processing.

Result: GLM outperforms CNNs in structural similarity and PSNR metrics while using significantly fewer parameters, less training time, and memory. It generalizes robustly to unseen geometry variations.

Conclusion: Graph-based neural networks provide superior geometric awareness for CT imaging, enabling better performance, efficiency, and generalization compared to traditional CNNs.

Abstract: Ensuring proper generalization is a critical challenge in applying data-driven methods for solving inverse problems in imaging, as neural networks reconstructing an image must perform well across varied datasets and acquisition geometries. In X-ray Computed Tomography (CT), convolutional neural networks (CNNs) are widely used to filter the projection data but are ill-suited for this task as they apply grid-based convolutions to the sinogram, which inherently lies on a line manifold, not a regular grid. The CNNs, unaware of the geometry, are implicitly tied to it and require an excessive amount of parameters as they must infer the relations between measurements from the data rather than from prior information. The contribution of this paper is twofold. First, we introduce a graph data structure to represent CT acquisition geometries and tomographic data, providing a detailed explanation of the graph’s structure for circular, cone-beam geometries. Second, we propose GLM, a hybrid neural network architecture that leverages both graph and grid convolutions to process tomographic data. We demonstrate that GLM outperforms CNNs when performance is quantified in terms of structural similarity and peak signal-to-noise ratio, despite the fact that GLM uses only a fraction of the trainable parameters. Compared to CNNs, GLM also requires significantly less training time and memory, and its memory requirements scale better. Crucially, GLM demonstrates robust generalization to unseen variations in the acquisition geometry, like when training only on fully sampled CT data and then testing on sparse-view CT data.

[1233] BrainNormalizer: Anatomy-Informed Pseudo-Healthy Brain Reconstruction from Tumor MRI via Edge-Guided ControlNet

Min Gu Kwak, Yeonju Lee, Hairong Wang, Jing Li

Main category: eess.IV

TL;DR: BrainNormalizer is a diffusion-based framework that reconstructs pseudo-healthy brain MRIs from tumorous scans using boundary guidance, enabling anatomically plausible reconstruction without paired data.

DetailsMotivation: Brain tumors cause significant anatomical deformation that complicates diagnosis and treatment, but obtaining subject-specific healthy brain references is impossible in clinical practice.

Method: Two-stage training: first fine-tune diffusion model on tumorous/non-tumorous scans, then train ControlNet with edge-map guidance. Inference uses misalignment strategy with non-tumorous prompts and contralateral edge maps.

Result: On BraTS2020 dataset, achieves strong quantitative performance and produces anatomically plausible reconstructions in tumor-affected regions while maintaining structural coherence.

Conclusion: Provides clinically reliable anatomical references for treatment planning and enables new research in counterfactual modeling and tumor deformation analysis.

Abstract: Brain tumors are among the most clinically significant neurological diseases and remain a major cause of morbidity and mortality due to their aggressive growth and structural heterogeneity. As tumors expand, they induce substantial anatomical deformation that disrupts both local tissue organization and global brain architecture, complicating diagnosis, treatment planning, and surgical navigation. Yet a subject-specific reference of how the brain would appear without tumor-induced changes is fundamentally unobtainable in clinical practice. We present BrainNormalizer, an anatomy-informed diffusion framework that reconstructs pseudo-healthy MRIs directly from tumorous scans by conditioning the generative process on boundary cues extracted from the subject’s own anatomy. This boundary-guided conditioning enables anatomically plausible pseudo-healthy reconstruction without requiring paired non-tumorous and tumorous scans. BrainNormalizer employs a two-stage training strategy. The pretrained diffusion model is first adapted through inpainting-based fine-tuning on tumorous and non-tumorous scans. Next, an edge-map-guided ControlNet branch is trained to inject fine-grained anatomical contours into the frozen decoder while preserving learned priors. During inference, a deliberate misalignment strategy pairs tumorous inputs with non-tumorous prompts and mirrored contralateral edge maps, leveraging hemispheric correspondence to guide reconstruction. On the BraTS2020 dataset, BrainNormalizer achieves strong quantitative performance and qualitatively produces anatomically plausible reconstructions in tumor-affected regions while retaining overall structural coherence. BrainNormalizer provides clinically reliable anatomical references for treatment planning and supports new research directions in counterfactual modeling and tumor-induced deformation analysis.

[1234] cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold

Zain Shabeeb, Daniel Saeedi, Darin Tsui, Vida Jamali, Amirali Aghazadeh

Main category: eess.IV

TL;DR: cryoSENSE is a compressive sensing framework that enables cryo-EM data acquisition with up to 2.5× throughput increase while maintaining 3D resolution, using sparse and generative priors for reconstruction from undersampled measurements.

DetailsMotivation: Modern cryo-EM detectors generate massive data volumes exceeding storage and transfer bandwidth, constraining practical throughput despite enabling atomic-resolution biomolecule visualization.

Method: Hardware-software co-designed framework leveraging low-dimensional manifolds of cryo-EM images, using sparse priors in predefined bases and generative priors from denoising diffusion models for reconstruction from spatial and Fourier-domain undersampled measurements.

Result: Achieves up to 2.5× acquisition throughput increase while preserving original 3D resolution, with controllable trade-offs between masked measurements and downsampling level. Sparse priors work best for Fourier-domain measurements, while diffusion priors excel with pixel-domain measurements and severe undersampling.

Conclusion: cryoSENSE provides an effective compressive sensing solution for cryo-EM that addresses data volume constraints while maintaining structural resolution, offering practical throughput improvements for the field.

Abstract: Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling. Project website: https://cryosense.github.io.

[1235] Inertia-Informed Orientation Priors for Event-Based Optical Flow Estimation

Pritam P. Karmokar, William J. Beksi

Main category: eess.IV

TL;DR: A hybrid contrast maximization method for event-based optical flow estimation that combines visual and inertial cues using orientation maps derived from camera 3D velocities to guide the optimization process.

DetailsMotivation: Event cameras directly encode motion but their temporally dense yet spatially sparse nature poses challenges for optical flow estimation. Contrast maximization is a prominent method but remains highly non-convex and challenging to optimize.

Method: Proposes a biologically-inspired hybrid approach that couples visual and inertial motion cues. Uses orientation maps derived from camera 3D velocities as priors to guide the contrast maximization process, providing directional guidance and constraining motion trajectory space.

Result: The orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. Evaluation on MVSEC, DSEC, and ECD datasets shows superior accuracy scores over state-of-the-art methods.

Conclusion: The proposed hybrid method successfully addresses the non-convex optimization challenges in contrast maximization by incorporating inertial cues through orientation maps, resulting in more robust and accurate event-based optical flow estimation.

Abstract: Event cameras, by virtue of their working principle, directly encode motion within a scene. Many learning-based and model-based methods exist that estimate event-based optical flow, however the temporally dense yet spatially sparse nature of events poses significant challenges. To address these issues, contrast maximization (CM) is a prominent model-based optimization methodology that estimates the motion trajectories of events within an event volume by optimally warping them. Since its introduction, the CM framework has undergone a series of refinements by the computer vision community. Nonetheless, it remains a highly non-convex optimization problem. In this paper, we introduce a novel biologically-inspired hybrid CM method for event-based optical flow estimation that couples visual and inertial motion cues. Concretely, we propose the use of orientation maps, derived from camera 3D velocities, as priors to guide the CM process. The orientation maps provide directional guidance and constrain the space of estimated motion trajectories. We show that this orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. The evaluation of our approach on the MVSEC, DSEC, and ECD datasets yields superior accuracy scores over the state of the art.

[1236] PyPeT: A Python Perfusion Tool for Automated Quantitative Brain CT and MR Perfusion Analysis

Marijn Borghouts, Ruisheng Su

Main category: eess.IV

TL;DR: PyPeT is an open-source Python tool for processing CT and MR perfusion data that generates quantitative cerebral hemodynamic maps, offering a free and customizable alternative to expensive commercial software.

DetailsMotivation: Commercial perfusion analysis software is costly, closed source, and lacks customizability, limiting accessibility for perfusion research.

Method: PyPeT uses a unified framework with modular design, low computational burden, and extensive documentation to process both CTP and MRP data, generating CBF, CBV, MTT, TTP, and Tmax maps from 4D perfusion data.

Result: Validation shows mean SSIM around 0.8 when compared with FDA-approved commercial perfusion tools and research tools, indicating good and stable correlation with established methods.

Conclusion: PyPeT successfully provides an accessible, customizable, and validated open-source alternative for perfusion analysis that bridges the gap between commercial and research tools.

Abstract: Computed tomography perfusion (CTP) and magnetic resonance perfusion (MRP) are widely used in acute ischemic stroke assessment and other cerebrovascular conditions to generate quantitative maps of cerebral hemodynamics. While commercial perfusion analysis software exists, it is often costly, closed source, and lacks customizability. This work introduces PyPeT, an openly available Python Perfusion Tool for head CTP and MRP processing. PyPeT is capable of producing cerebral blood flow (CBF), cerebral blood volume (CBV), mean transit time (MTT), time-to-peak (TTP), and time-to-maximum (Tmax) maps from raw four-dimensional perfusion data. PyPeT aims to make perfusion research as accessible and customizable as possible. This is achieved through a unified framework in which both CTP and MRP data can be processed, with a strong focus on modularity, low computational burden, and significant inline documentation. PyPeT’s outputs can be validated through an extensive debug mode in which every step of the process is visualized. Additional validation was performed via visual and quantitative comparison with reference perfusion maps generated by three FDA-approved commercial perfusion tools and a research tool. These comparisons show a mean SSIM around 0.8 for all comparisons, indicating a good and stable correlation with FDA-approved tools. The code for PyPeT is openly available at our GitHub https://github.com/Marijn311/CT-and-MR-Perfusion-Tool

[1237] Smooth Total variation Regularization for Interference Detection and Elimination (STRIDE) for MRI

Alexander Mertens, Diego Martinez, Amgad Louka, Ying Yang, Chad Harris, Ian Connell

Main category: eess.IV

TL;DR: STRIDE method improves EMI removal in MRI by exploiting image smoothness through total variation optimization, outperforming standard methods.

DetailsMotivation: MRI needs to function near electronic devices that emit dynamic electromagnetic interference, requiring better EMI removal methods.

Method: STRIDE measures data from EMI detectors and MR coils, transforms to image domain, and optimizes total-variation smoothness for each column to remove EMI.

Result: Tested on 0.5T scanner with phantom and in-vivo data, STRIDE showed better visual EMI removal, higher temporal SNR, larger EMI removal percentage, and lower RMSE.

Conclusion: STRIDE is a robust technique that leverages MR image properties for improved EMI removal, especially effective for time-varying noise sources.

Abstract: MRI is increasingly desired to function near electronic devices that emit potentially dynamic electromagnetic interference (EMI). To accommodate for this, we propose the STRIDE method, which improves on previous external-sensor-based EMI removal methods by exploiting inherent MR image smoothness in its total variation. STRIDE measures data from both EMI detectors and primary MR imaging coils, transforms this data into the image domain, and for each column of the resulting image array, combines and subtracts data from the EMI detectors in a way that optimizes for total-variation smoothness. Performance was tested on phantom and in-vivo datasets with a 0.5T scanner. STRIDE resulted in visually better EMI removal, higher temporal SNR, larger EMI removal percentage, and lower RMSE than standard implementations. STRIDE is a robust technique that leverages inherent MR image properties to provide improved EMI removal performance over standard algorithms, particularly for time-varying noise sources.

[1238] Tubular Curvature Filter: Pointwise Curvature Calculation for Tubular Objects in Images

Elifnur Sunger, Beyza Kalkanli, Veysi Yildiz, Tales Imbiriba, Giovanna Guidoboni, Peter Campbell, Deniz Erdogmus

Main category: eess.IV

TL;DR: The paper introduces a Tubular Curvature Filter (TCF) method that accurately estimates local curvature in tubular structures without explicit segmentation, addressing limitations of centerline-based approaches in medical imaging.

DetailsMotivation: Accurate blood vessel tortuosity estimation is crucial for retinopathy of prematurity (ROP) diagnosis, but existing centerline-based methods fail to capture curvature gradients across rotating tubular structures.

Method: TCF locally calculates acceleration of curve bundles traversing tubular objects by examining directional rate of change in Hessian matrix eigenvectors, eliminating need for explicit segmentation or centerline extraction.

Result: TCF provides accurate local curvature estimates and discerns curvature differences between inner and outer sides of curved tubular objects, which centerline-based approaches cannot achieve.

Conclusion: TCF’s ability to differentiate inner and outer curvature is particularly valuable for medical vasculature analysis, especially in conditions like ROP where vessels have non-uniform diameters.

Abstract: Purpose: Accurate estimation of blood vessel tortuosity from medical images is an extremely important and challenging task. It is particularly relevant in the context of retinopathy of prematurity (ROP), where the staging of disease severity and consequent therapeutic approaches are heavily informed by the presence and prominence of vessel tortuosity. Existing methods based on centerline or skeleton curvature fail to capture curvature gradients across a rotating tubular structure, thereby limiting their effectiveness in the case of ROP. Methods: This paper defines local tubular curvature and presents the Tubular Curvature Filter (TCF) method, which locally calculates the acceleration of curve bundles traversing a tubular object parallel to its centerline. This is achieved by examining the directional rate of change in the eigenvectors of the Hessian matrix of a tubular intensity function in space. TCF implicitly calculates the local tubular curvature without the need to explicitly segment or extracting the centerline of the tubular object. Results: Experimental results demonstrate that TCF provides accurate estimates of local curvature at any point inside tubular structures. Results on 2D and 3D images show that TCF discerns curvature differences between the inner and outer sides of curved tubular objects, while centerline-based approaches cannot. Conclusion: Our findings highlight that TCF’s ability to discern between the inner and outer sides of curved tubular objects is particularly useful in medical fields that require vasculature curvature analysis from images, especially where vascular structures often have non-uniform diameters, such as in ROP.

[1239] Towards Collective Intelligence: Uncertainty-aware SAM Adaptation for Ambiguous Medical Image Segmentation

Mingzhou Jiang, Jiaying Zhou, Junde Wu, Tianyang Wang, Yueming Jin, Min Xu

Main category: eess.IV

TL;DR: Proposes an Uncertainty-aware Adapter for SAM that transitions from single-expert to collective intelligence representation in medical image segmentation, capturing expert knowledge distributions rather than individual annotations.

DetailsMotivation: Existing SAM adaptation methods follow single-expert paradigm and ignore inherent uncertainty/variability in expert annotations, contradicting clinical practice where multiple specialists provide different valid interpretations.

Method: Integrates stochastic uncertainty sampling from Conditional Variational Autoencoder into adapters, uses position-conditioned control mechanism to integrate multi-expert knowledge, enabling diverse prediction generation.

Result: Comprehensive evaluations across seven medical segmentation benchmarks demonstrate superior performance while maintaining computational efficiency.

Conclusion: Establishes a new adaptation framework for reliable clinical implementation that captures collective intelligence from multiple medical experts.

Abstract: Collective intelligence from multiple medical experts consistently surpasses individual expertise in clinical diagnosis, particularly for ambiguous medical image segmentation tasks involving unclear tissue boundaries or pathological variations. The Segment Anything Model (SAM), a powerful vision foundation model originally designed for natural image segmentation, has shown remarkable potential when adapted to medical image segmentation tasks. However, existing SAM adaptation methods follow a single-expert paradigm, developing models based on individual expert annotations to predict deterministic masks. These methods systematically ignore the inherent uncertainty and variability in expert annotations, which fundamentally contradicts clinical practice, where multiple specialists provide different yet equally valid interpretations that collectively enhance diagnostic confidence. We propose an Uncertainty-aware Adapter, the first SAM adaptation framework designed to transition from single expert mindset to collective intelligence representation. Our approach integrates stochastic uncertainty sampling from a Conditional Variational Autoencoder into the adapters, enabling diverse prediction generation that captures expert knowledge distributions rather than individual expert annotations. We employ a novel position-conditioned control mechanism to integrate multi-expert knowledge, ensuring that the output distribution closely aligns with the multi-annotation distribution. Comprehensive evaluations across seven medical segmentation benchmarks have demonstrated that our collective intelligence-based adaptation achieves superior performance while maintaining computational efficiency, establishing a new adaptation framework for reliable clinical implementation.

[1240] Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Krishna Srikar Durbha, Alan C. Bovik

Main category: eess.IV

TL;DR: A perceptually optimized method for constructing per-shot bitrate and quality ladders using ensemble features and VIF, achieving significant computational advantages with minimal quality loss compared to exhaustive encoding.

DetailsMotivation: Video service providers need adaptive delivery systems that can respond to network conditions, user preferences, and display settings. HTTP Adaptive Streaming requires dynamic switching between video representations to optimize bandwidth consumption and user experience.

Method: Uses an ensemble of low-level features and Visual Information Fidelity (VIF) features to predict optimal per-shot bitrate and quality ladders without compression or quality estimation during inference. Compares against content-adaptive methods, fixed ladders, and reference ladders from exhaustive encoding.

Result: The proposed method shows excellent gains in bitrate and quality over fixed bitrate ladders, with only small losses against reference ladders constructed via exhaustive encoding. Provides significant computational advantages.

Conclusion: The perceptually optimized approach effectively constructs optimal per-shot bitrate and quality ladders, offering substantial computational benefits while maintaining high video quality comparable to exhaustive encoding methods.

Abstract: Video service providers need their delivery systems to be able to adapt to network conditions, user preferences, display settings, and other factors. HTTP Adaptive Streaming (HAS) offers dynamic switching between different video representations to simultaneously enhance bandwidth consumption and users’ streaming experiences. Per-shot encoding, pioneered by Netflix, optimizes the encoding parameters on each scene or shot. The Dynamic Optimizer (DO) uses the Video Multi-Method Assessment Fusion (VMAF) perceptual video quality prediction engine to deliver high-quality videos at reduced bitrates. Here we develop a perceptually optimized method of constructing optimal per-shot bitrate and quality ladders, using an ensemble of low-level features and Visual Information Fidelity (VIF) features. During inference, our method predicts the bitrate or quality ladder of a source video without any compression or quality estimation. We compare the performance of our model against other content-adaptive bitrate ladder prediction methods, a fixed bitrate ladder, and reference bitrate ladders constructed via exhaustive encoding using Bjontegaard-delta (BD) metrics. Our proposed method shows excellent gains in bitrate and quality against the fixed bitrate ladder and only small losses against the reference bitrate ladder, while providing significant computational advantages.

[1241] Subjective and Objective Quality Evaluation of Super-Resolution Enhanced Broadcast Images on a Novel SR-IQA Dataset

Yongrok Kim, Junha Shin, Juhyun Lee, Hyunsuk Ko

Main category: eess.IV

TL;DR: This paper introduces a new Image Quality Assessment dataset for Super-Resolution broadcast images in 2K and 4K resolutions, revealing limitations of current IQA metrics for SR content.

DetailsMotivation: There's a lack of research on Image Quality Assessment for SR images, especially when evaluating from low-quality sources without original high-quality references, which is crucial for broadcast content display.

Method: Created a new IQA dataset for SR broadcast images, conducted subjective quality evaluation to obtain MOS, performed human study to identify key quality factors, and evaluated existing IQA metrics on the dataset.

Result: The study revealed limitations of current IQA metrics in assessing SR image quality, showing they don’t correlate well with perceived quality of SR-enhanced broadcast content.

Conclusion: There’s a need for more robust IQA metrics that better correlate with perceived quality of SR images, and the proposed dataset provides a foundation for developing such metrics.

Abstract: Super-Resolution (SR) is essential for displaying low-quality broadcast content on high-resolution screens. Recently, SR methods have been developed that not only increase resolution while preserving the original image information but also enhance the perceived quality. However, evaluating the quality of SR images generated from low-quality sources, such as SR-enhanced broadcast content, is challenging due to the need to consider both distortions and improvements. Additionally, assessing SR image quality without original high-quality sources presents another significant challenge. Unfortunately, there has been a dearth of research specifically addressing the Image Quality Assessment (IQA) of SR images under these conditions. In this work, we introduce a new IQA dataset for SR broadcast images in both 2K and 4K resolutions. We conducted a subjective quality evaluation to obtain Mean Opinion Score (MOS) for these SR images and performed a comprehensive human study to identify key factors influencing perceived quality. Finally, we evaluated the performance of existing IQA metrics on our dataset. This study reveals the limitations of current metrics, highlighting the need for a more robust IQA metric that better correlates with the perceived quality of SR images. The proposed dataset and the subjective evaluation platform are publicly available at https://sites.google.com/hanyang.ac.kr/ivml/datasets/sreb.

[1242] Beyond H&E: Unlocking Pathological Insights with Polarization Imaging

Yao Du, Jiaxin Zhuang, Xiaoyu Zheng, Jing Cong, Limei Guo, Chao He, Lin Luo, Xiaomeng Li

Main category: eess.IV

TL;DR: PolarHE is a dual-modality fusion framework that combines H&E histopathology imaging with polarization imaging to improve tissue characterization and diagnostic accuracy in computational pathology.

DetailsMotivation: H&E imaging lacks sensitivity to birefringence and tissue anisotropy, which are crucial for assessing collagen organization and microstructural alterations in pathological conditions like tumor progression and fibrosis.

Method: Constructed a polarization imaging system and curated a dataset of 13,000+ paired Polar-H&E images. Proposed PolarHE framework with feature decomposition strategy to disentangle common and modality-specific features for effective multimodal representation learning.

Result: Achieved 86.70% accuracy on Chaoyang dataset and 89.06% on MHIST dataset, significantly outperforming previous methods.

Conclusion: Polarization imaging is a powerful underutilized modality that enriches feature representation and improves diagnostic accuracy, establishing a promising direction for multimodal learning in pathology.

Abstract: Histopathology image analysis is fundamental to digital pathology, with hematoxylin and eosin (H&E) staining as the gold standard for diagnostic and prognostic assessments. While H&E imaging effectively highlights cellular and tissue structures, it lacks sensitivity to birefringence and tissue anisotropy, which are crucial for assessing collagen organization, fiber alignment, and microstructural alterations–key indicators of tumor progression, fibrosis, and other pathological conditions. To bridge this gap, we construct a polarization imaging system and curate a new dataset of over 13,000 paired Polar-H&E images. Visualizations of polarization properties reveal distinctive optical signatures in pathological tissues, underscoring its diagnostic value. Building on this dataset, we propose PolarHE, a dual-modality fusion framework that integrates H&E with polarization imaging, leveraging the latter ability to enhance tissue characterization. Our approach employs a feature decomposition strategy to disentangle common and modality specific features, ensuring effective multimodal representation learning. Through comprehensive validation, our approach significantly outperforms previous methods, achieving an accuracy of 86.70% on the Chaoyang dataset and 89.06% on the MHIST dataset. These results demonstrate that polarization imaging is a powerful and underutilized modality in computational pathology, enriching feature representation and improving diagnostic accuracy. PolarHE establishes a promising direction for multimodal learning, paving the way for more interpretable and generalizable pathology models.

[1243] Federated Continual 3D Segmentation With Single-round Communication

Can Peng, Qianhui Men, Pramit Saha, Qianye Yang, Cheng Ouyang, J. Alison Noble

Main category: eess.IV

TL;DR: Proposes a federated continual learning strategy using one-time model aggregation via multi-model distillation to handle dynamic scenarios where new clients join or label sets expand, reducing communication overhead and synchronization requirements.

DetailsMotivation: Traditional federated learning assumes static settings, but real-world scenarios involve dynamic changes like new clients joining or evolving label sets, making conventional model aggregation inefficient with high communication costs and synchronization difficulties.

Method: Uses a one-time model aggregation at the server through multi-model distillation, reusing previous client models when integrating new data streams or onboarding new clients, eliminating frequent server communication and global model retraining.

Result: Demonstrated effectiveness using multi-class 3D abdominal CT segmentation, showing reduced communication load and relaxed synchronization requirements while maintaining performance.

Conclusion: Provides an efficient and scalable federated analysis framework suitable for real-world dynamic applications by minimizing communication overhead and bypassing the need for unchanged clients to be online.

Abstract: Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.

[1244] Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement

Haofan Wu, Yin Huang, Yuqing Wu, Qiuyu Yang, Bingfang Wang, Li Zhang, Muhammad Fahadullah Khan, Ali Zia, M. Saleh Memon, Syed Sohail Bukhari, Abdul Fattah Memon, Daizong Ji, Ya Zhang, Ghulam Mustafa, Yin Fang

Main category: eess.IV

TL;DR: MTRL-FIE is a multi-scale target-aware framework for fundus image enhancement that addresses limitations in existing methods by restoring comprehensive multi-scale information and focusing on pathological regions.

DetailsMotivation: Fundus images often suffer from low resolution and signal-to-noise ratio due to hardware limitations and operational variability. Existing enhancement methods lack unified frameworks for comprehensive multi-scale recovery and fail to specifically target lesion enhancement crucial for medical diagnosis.

Method: Proposes MTRL-FIE with three key components: multi-scale feature encoder using wavelet decomposition, structure-preserving hierarchical decoder for feature fusion, and target-aware feature aggregation module to enhance pathological regions and reduce artifacts.

Result: Experimental results show MTRL-FIE achieves superior enhancement performance with lightweight architecture compared to state-of-the-art methods, and generalizes to other ophthalmic image processing tasks without supervised fine-tuning.

Conclusion: MTRL-FIE provides an effective and generalizable solution for fundus image enhancement with potential for clinical applications, demonstrating better performance with more efficient architecture.

Abstract: High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.

[1245] Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

Jeffrey Alido, Tongyu Li, Yu Sun, Lei Tian

Main category: eess.IV

TL;DR: Whitened Score diffusion models learn whitened score functions instead of standard scores, enabling stable training on arbitrary Gaussian noise processes and outperforming conventional diffusion models on imaging tasks.

DetailsMotivation: Conventional diffusion models struggle with anisotropic Gaussian diffusion due to required covariance matrix inversion in denoising score matching, limiting their applicability to arbitrary Gaussian noise processes.

Method: Proposed Whitened Score diffusion models based on stochastic differential equations that learn whitened score functions, circumventing covariance inversion and enabling training on arbitrary Gaussian forward noising processes.

Result: WS diffusion models outperform conventional diffusion priors on various computational imaging tasks using CIFAR, CelebA, and CelebA-HQ datasets, particularly when trained on anisotropic Gaussian noising processes.

Conclusion: Whitened Score diffusion models provide a stable framework for training diffusion models on arbitrary Gaussian noise, establish equivalence with flow matching, enable spectral inductive biases, and offer strong Bayesian priors for imaging inverse problems with structured noise.

Abstract: Conventional score-based diffusion models (DMs) may struggle with anisotropic Gaussian diffusion processes due to the required inversion of covariance matrices in the denoising score matching training objective \cite{vincent_connection_2011}. We propose Whitened Score (WS) diffusion models, a novel framework based on stochastic differential equations that learns the Whitened Score function instead of the standard score. This approach circumvents covariance inversion, extending score-based DMs by enabling stable training of DMs on arbitrary Gaussian forward noising processes. WS DMs establish equivalence with flow matching for arbitrary Gaussian noise, allow for tailored spectral inductive biases, and provide strong Bayesian priors for imaging inverse problems with structured noise. We experiment with a variety of computational imaging tasks using the CIFAR, CelebA ($64\times64$), and CelebA-HQ ($256\times256$) datasets and demonstrate that WS diffusion priors trained on anisotropic Gaussian noising processes consistently outperform conventional diffusion priors based on isotropic Gaussian noise. Our code is open-sourced at \href{https://github.com/jeffreyalido/wsdiffusion}{\texttt{github.com/jeffreyalido/wsdiffusion}}.

[1246] Towards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport

Taoran Zheng, Yan Yang, Xing Li, Xiang Gu, Jian Sun, Zongben Xu

Main category: eess.IV

TL;DR: KIDOT is a dynamic optimal transport framework for medical image reconstruction that bridges the retrospective-to-prospective gap by learning from unpaired data while preserving imaging physics consistency.

DetailsMotivation: Address the performance degradation in deep learning reconstruction methods when moving from simulated paired data to real prospective data due to incomplete imaging knowledge in simulation.

Method: Proposes imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT) that models reconstruction as a continuous evolution path from measurements to images, guided by imaging physics-informed cost function and transport equation.

Result: Extensive experiments on MRI and CT reconstruction demonstrate KIDOT’s superior performance compared to existing methods.

Conclusion: KIDOT provides a mathematically rigorous framework that enhances robustness in medical image reconstruction by better leveraging unpaired data while respecting acquisition physics.

Abstract: Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT’s superior performance.

[1247] Unsupervised patch-based dynamic MRI reconstruction using learnable tensor function with implicit neural representation

Yuanyuan Liu, Yuanbiao Yang, Jing Cheng, Zhuo-Xu Cui, Qingyong Zhu, Congcong Liu, Yuliang Zhu, Jingran Xu, Hairong Zheng, Dong Liang, Yanjie Zhu

Main category: eess.IV

TL;DR: TenF-INR integrates low-rank tensor modeling with implicit neural representations for unsupervised dynamic MRI reconstruction, achieving up to 21-fold acceleration with superior image quality compared to state-of-the-art methods.

DetailsMotivation: Dynamic MRI suffers from limited spatiotemporal resolution due to long acquisition times. Supervised deep learning methods require large fully sampled datasets that are difficult to obtain, while existing INR-based methods struggle with highly undersampled dynamic MRI due to inefficient representation capacity and high computational cost.

Method: Proposes TenF-INR framework that integrates low-rank tensor modeling with INR, where each factor matrix in tensor decomposition is modeled as a learnable factor function. Uses patch-based nonlocal tensor modeling to exploit temporal correlations and inter-patch similarities, reducing parameter space and computational burden.

Result: Experiments on dynamic cardiac and abdominal datasets demonstrate TenF-INR achieves up to 21-fold acceleration, outperforming both supervised and unsupervised state-of-the-art methods in image quality, temporal fidelity, and quantitative accuracy.

Conclusion: TenF-INR provides an effective unsupervised framework for dynamic MRI reconstruction that combines the benefits of low-rank tensor modeling and implicit neural representations, enabling high-quality reconstruction from highly undersampled data without requiring external training datasets.

Abstract: Dynamic MRI suffers from limited spatiotemporal resolution due to long acquisition times. Undersampling k-space accelerates imaging but makes accurate reconstruction challenging. Supervised deep learning methods achieve impressive results but rely on large fully sampled datasets, which are difficult to obtain. Recently, implicit neural representations (INR) have emerged as a powerful unsupervised paradigm that reconstructs images from a single undersampled dataset without external training data. However, existing INR-based methods still face challenges when applied to highly undersampled dynamic MRI, mainly due to their inefficient representation capacity and high computational cost. To address these issues, we propose TenF-INR, a novel unsupervised framework that integrates low-rank tensor modeling with INR, where each factor matrix in the tensor decomposition is modeled as a learnable factor function. Specifically,we employ INR to model learnable tensor functions within a low-rank decomposition, reducing the parameter space and computational burden. A patch-based nonlocal tensor modeling strategy further exploits temporal correlations and inter-patch similarities, enhancing the recovery of fine spatiotemporal details. Experiments on dynamic cardiac and abdominal datasets demonstrate that TenF-INR achieves up to 21-fold acceleration, outperforming both supervised and unsupervised state-of-the-art methods in image quality, temporal fidelity, and quantitative accuracy.

[1248] Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach

Ziheng Zhao, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: eess.IV

TL;DR: OmniAbnorm-CT is a comprehensive system for automated interpretation of CT images that can localize and describe abnormal findings across multi-plane and whole-body scans through taxonomy development, dataset creation, model development, and clinical evaluation.

DetailsMotivation: Automated interpretation of CT images, particularly localizing and describing abnormal findings across multi-plane and whole-body scans, remains a significant challenge in clinical radiology that needs to be addressed.

Method: Four key contributions: (i) hierarchical classification system with 404 abnormal findings, (ii) dataset with 14.5K CT images and 19K abnormality annotations, (iii) OmniAbnorm-CT model for automatic grounding and description of abnormalities based on text queries and visual prompts, (iv) three clinical tasks and evaluation metric.

Result: OmniAbnorm-CT significantly outperforms existing methods in both internal and external validations across all tasks, demonstrating superior performance in automated CT image interpretation.

Conclusion: The proposed comprehensive approach successfully addresses the challenge of automated CT image interpretation through taxonomy, data, model development, and clinical evaluation, with OmniAbnorm-CT showing superior performance over existing methods.

Abstract: Automated interpretation of CT images-particularly localizing and describing abnormal findings across multi-plane and whole-body scans-remains a significant challenge in clinical radiology. This work aims to address this challenge through four key contributions: (i) On taxonomy, we collaborate with senior radiologists to propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions; (ii) On data, we contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities, each linked to the detailed description and cast into the taxonomy; (iii) On model development, we propose OmniAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries, while also allowing flexible interaction through visual prompts; (iv) On evaluation, we establish three representative tasks based on real clinical scenarios, and introduce a clinically grounded metric to assess abnormality descriptions. Through extensive experiments, we show that OmniAbnorm-CT can significantly outperform existing methods in both internal and external validations, and across all the tasks.

[1249] An Explainable Deep Learning Framework for Brain Stroke and Tumor Progression via MRI Interpretation

Rajan Das Gupta, Md Imrul Hasan Showmick, Mushfiqur Rahman Abir, Shanjida Akter, Md. Yeasin Rahat, Md. Jakir Hossen

Main category: eess.IV

TL;DR: Deep learning system using MobileNet V2 and ResNet-50 for detecting brain tumors and strokes from MRI images with high accuracy (93% training, 88% validation).

DetailsMotivation: Early and accurate detection of brain abnormalities like tumors and strokes is essential for timely intervention and improved patient outcomes.

Method: Used transfer learning with convolutional neural networks (MobileNet V2 and ResNet-50) to classify MRI scans into five diagnostic categories, with dataset augmentation, dropout layers, and class balancing.

Result: Models achieved strong performance with 93% training accuracy and 88% validation accuracy. ResNet-50 performed slightly better, but MobileNet V2 is suitable for low-resource settings.

Conclusion: The research offers a practical AI-driven solution for early brain abnormality detection with potential for clinical deployment and future enhancements through larger datasets and multi-modal inputs.

Abstract: Early and accurate detection of brain abnormalities, such as tumors and strokes, is essential for timely intervention and improved patient outcomes. In this study, we present a deep learning-based system capable of identifying both brain tumors and strokes from MRI images, along with their respective stages. We have executed two groundbreaking strategies involving convolutional neural networks, MobileNet V2 and ResNet-50-optimized through transfer learning to classify MRI scans into five diagnostic categories. Our dataset, aggregated and augmented from various publicly available MRI sources, was carefully curated to ensure class balance and image diversity. To enhance model generalization and prevent overfitting, we applied dropout layers and extensive data augmentation. The models achieved strong performance, with training accuracy reaching 93% and validation accuracy up to 88%. While ResNet-50 demonstrated slightly better results, Mobile Net V2 remains a promising option for real-time diagnosis in low resource settings due to its lightweight architecture. This research offers a practical AI-driven solution for early brain abnormality detection, with potential for clinical deployment and future enhancement through larger datasets and multi modal inputs.

[1250] Sequential Attention-based Sampling for Histopathological Analysis

Tarun G, Naman Malpani, Gugan Thoppe, Sridharan Devarajan

Main category: eess.IV

TL;DR: SASHA is a deep reinforcement learning approach that uses sequential attention-based sampling to efficiently analyze gigapixel histopathology images by selectively zooming into only 10-20% of high-resolution patches while maintaining diagnostic accuracy.

DetailsMotivation: Whole-slide histopathology images are computationally infeasible to analyze entirely at high resolution, diagnostic labels are only available at slide-level, and diagnostic regions occupy only small fractions of the image, making full-resolution examination inefficient.

Method: Uses deep reinforcement learning with: 1) lightweight hierarchical attention-based multiple instance learning to learn informative features, 2) intelligent sampling to selectively zoom into only 10-20% of high-resolution patches.

Result: Matches state-of-the-art methods that analyze entire WSIs at high resolution, but with significantly reduced computational and memory costs. Significantly outperforms competing sparse sampling methods.

Conclusion: SASHA serves as an intelligent sampling model for medical imaging challenges involving automated diagnosis with exceptionally large images containing sparsely informative features.

Abstract: Deep neural networks are increasingly applied in automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering them computationally infeasible to analyze entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA – Sequential Attention-based Sampling for Histopathological Analysis – a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20%) of high-resolution patches to achieve reliable diagnoses. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features. Model implementation is available at: https://github.com/coglabiisc/SASHA.

[1251] Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers

Guixian Xu, Jinglai Li, Junqi Tang

Main category: eess.IV

TL;DR: Fast Equivariant Imaging (FEI) is a novel unsupervised framework that accelerates deep imaging network training by 10x without ground-truth data, using Lagrange multipliers and plug-and-play denoisers.

DetailsMotivation: To overcome the computational inefficiency of standard Equivariant Imaging (EI) methods for training deep imaging networks without requiring ground-truth data.

Method: Reformulates the Equivariant Imaging optimization problem using Lagrange multipliers and incorporates plug-and-play denoisers to create a more efficient unsupervised training scheme.

Result: Achieves 10x acceleration over standard EI when training U-Net for X-ray CT reconstruction and image inpainting, with improved generalization performance.

Conclusion: FEI provides a superior unsupervised learning framework that significantly accelerates training while maintaining or improving performance compared to vanilla Equivariant Imaging.

Abstract: In this work, we propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to rapidly and efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to the vanilla Equivariant Imaging paradigm. In particular, our FEI schemes achieve an order-of-magnitude (10x) acceleration over standard EI on training U-Net for X-ray CT reconstruction and image inpainting, with improved generalization performance.

[1252] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen, Peter K. Sorger, Hanspeter Pfister, Won-Ki Jeong

Main category: eess.IV

TL;DR: A novel framework using latent diffusion models to generate multiplex biomarker images from H&E stains, enabling virtual multiplex staining with up to 18 different marker types.

DetailsMotivation: Multiplex imaging provides molecular insights but is complex and costly, while existing H&E image repositories lack corresponding multiplex data, limiting multimodal analysis opportunities.

Method: Uses pretrained latent diffusion models with conditional diffusion for marker-by-marker generation, fine-tuned for single-step sampling with pixel-level loss functions to handle varying stain distributions and improve efficiency.

Result: Validated on two public datasets, achieving generation of up to 18 different marker types with improved accuracy, significantly surpassing previous approaches that only handled 2-3 markers.

Conclusion: Bridges the gap between H&E and multiplex imaging, enabling retrospective studies and large-scale analysis of existing H&E repositories through virtual multiplex staining.

Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions by utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.

[1253] DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation

Uğurcan Akyüz, Deniz Katircioglu-Öztürk, Emre K. Süslü, Burhan Keleş, Mete C. Kaya, Gamze Durhan, Meltem G. Akpınar, Figen B. Demirkazık, Gözde B. Akar

Main category: eess.IV

TL;DR: DoSReMC is a batch normalization adaptation framework that enhances cross-domain generalization for mammography classification by fine-tuning only BN and FC layers, addressing domain shift issues without full model retraining.

DetailsMotivation: Deep learning models for breast cancer recognition suffer performance drops when applied to different domains due to domain shift, limiting safe AI deployment in clinical settings.

Method: Fine-tune only batch normalization and fully connected layers while preserving pretrained convolutional filters, integrated with adversarial training to improve cross-domain generalization and reduce computational costs.

Result: BN layers are identified as a primary source of domain dependence, and DoSReMC significantly improves cross-domain performance across three large-scale FFDM datasets including a new pathologically confirmed in-house dataset.

Conclusion: DoSReMC provides a practical pathway for robust and generalizable mammography classification that can be easily integrated into existing AI pipelines across diverse clinical environments.

Abstract: Numerous deep learning-based solutions have been developed for the automatic recognition of breast cancer using mammography images. However, their performance often declines when applied to data from different domains, primarily due to domain shift - the variation in data distributions between source and target domains. This performance drop limits the safe and equitable deployment of AI in real-world clinical settings. In this study, we present DoSReMC (Domain Shift Resilient Mammography Classification), a batch normalization (BN) adaptation framework designed to enhance cross-domain generalization without retraining the entire model. Using three large-scale full-field digital mammography (FFDM) datasets - including HCTP, a newly introduced, pathologically confirmed in-house dataset - we conduct a systematic cross-domain evaluation with convolutional neural networks (CNNs). Our results demonstrate that BN layers are a primary source of domain dependence: they perform effectively when training and testing occur within the same domain, and they significantly impair model generalization under domain shift. DoSReMC addresses this limitation by fine-tuning only the BN and fully connected (FC) layers, while preserving pretrained convolutional filters. We further integrate this targeted adaptation with an adversarial training scheme, yielding additional improvements in cross-domain generalizability while reducing the computational cost of model training. DoSReMC can be readily incorporated into existing AI pipelines and applied across diverse clinical environments, providing a practical pathway toward more robust and generalizable mammography classification systems.

[1254] Dedelayed: Deleting remote inference delay via on-device correction

Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar

Main category: eess.IV

TL;DR: Dedelayed is a real-time video inference system that splits computation between local and remote models to overcome latency limitations, improving segmentation accuracy by 6.4-9.8 mIoU compared to fully local or remote approaches.

DetailsMotivation: Current video understanding models are too expensive for resource-constrained platforms, and cloud offloading has latency issues while local inference sacrifices accuracy due to computational constraints.

Method: Divides computation between remote model (processing delayed frames) and local model (current frame), with remote model trained to predict future frames. Uses joint optimization with autoencoder to limit transmission bitrate.

Result: On BDD100k driving dataset with 100ms delay: 6.4 mIoU improvement over fully local inference, 9.8 mIoU improvement over remote inference - equivalent to using 10x larger model.

Conclusion: Dedelayed effectively addresses real-time video inference challenges by combining local and remote computation, achieving significant accuracy improvements while maintaining low latency.

Abstract: Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference – an equivalent improvement to using a model ten times larger.

[1255] Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation

Xuanyu Tian, Lixuan Chen, Qing Wu, Xiao Wang, Jie Feng, Yuyao Zhang, Hongjiang Wei

Main category: eess.IV

TL;DR: MoCo-INR is an unsupervised method that combines implicit neural representations with motion-compensated framework for high-quality cardiac MRI reconstruction from highly undersampled data, achieving 20x acceleration with superior results.

DetailsMotivation: Current CMR reconstruction methods either produce unsatisfactory image quality or are limited by scarce ground truth data, restricting clinical applicability.

Method: Integrates implicit neural representations (INR) with motion-compensated framework using explicit motion modeling and continuous INR priors, with a specialized INR network architecture for CMR.

Result: Achieves superior performance over state-of-the-art methods with fast convergence and fine-detailed reconstructions at 20x acceleration, validated on both retrospective and prospective free-breathing CMR scans.

Conclusion: MoCo-INR demonstrates clinical practicality for real-time CMR imaging and effectiveness through ablation studies confirming critical components.

Abstract: Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.

Last updated: 2025-11-28
Built with Hugo, theme modified on Stack