Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 92]
cs.CV [Total: 132]
cs.AI [Total: 57]
cs.SD [Total: 13]
cs.LG [Total: 100]
cs.MA [Total: 5]
cs.MM [Total: 4]
eess.AS [Total: 11]
eess.IV [Total: 12]

cs.CL

[1] Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models

Alaa Alhamzeh, Mays Al Rebdawi

Main category: cs.CL

TL;DR: The paper evaluates GPT-4o, Llama 3.1, and Gemma 2 for annotating financial argument quality, comparing them to human annotations and testing for gender bias. Findings show higher consistency in LLMs but lingering bias issues.

Details

Motivation: Assessing argument quality in financial communications is understudied, and LLMs offer potential for reliable, scalable annotation.

Method: Uses the FinArgQuality dataset to evaluate LLM annotation consistency and introduces an adversarial attack to test gender bias. Experiments vary temperature settings.

Result: LLMs achieve higher inter-annotator agreement than humans but show gender bias. Temperature settings affect stability.

Conclusion: LLMs are promising for annotation but require bias mitigation. Recommendations are provided for future research.

Abstract: Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model’s fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.

[2] TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection

Tarık Saraç, Selin Mergen, Mucahid Kutlu

Main category: cs.CL

TL;DR: A novel council debate method using LLMs for detecting scientific content in tweets, excelling in identifying references to scientific studies.

Details

Motivation: To improve detection of scientific claims, references, and entities in tweets through structured LLM debates.

Method: Three debating methods: single debate, team debate, and council debate (the primary model).

Result: Council debate ranked first in detecting references to scientific studies but performed poorly for claims and entities.

Conclusion: Council debate is effective for specific tasks but requires refinement for broader applicability.

Abstract: In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.

[3] Heartificial Intelligence: Exploring Empathy in Language Models

Victoria Williams, Benjamin Rosman

Main category: cs.CL

TL;DR: LLMs outperform humans in cognitive empathy but lag in affective empathy, showing potential for virtual companionship without emotional fatigue.

Details

Motivation: To assess cognitive and affective empathy in language models compared to humans, given their growing role as virtual assistants and companions.

Method: Standardized psychological tests were used to evaluate cognitive and affective empathy across small and large language models and human participants.

Result: LLMs surpassed humans in cognitive empathy but showed significantly lower affective empathy.

Conclusion: LLMs’ high cognitive empathy suggests strong potential for virtual companionship, while their low affective empathy ensures objective, consistent support without emotional bias.

Abstract: Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others’ thoughts and emotions) and affective empathy (emotionally sharing others’ feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models’ ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias.

[4] Real-time News Story Identification

Tadej Škvorc, Nikola Ivačič, Sebastjan Hribar, Marko Robnik-Šikonja

Main category: cs.CL

TL;DR: The paper presents a real-time story identification method for news articles, combining text representation, clustering, and online topic modeling to group articles by specific events, places, and people.

Details

Motivation: To enhance news monitoring systems by automatically assigning articles to topical stories in real time, focusing on specific events rather than general text similarity or predefined topics.

Method: Combines text representation techniques, clustering algorithms, and online topic modeling (e.g., BERTopic, DBStream, TextClust) to identify and group news articles into stories.

Result: The approach was evaluated on a Slovene news dataset and produced sensible results as judged by human evaluators.

Conclusion: The proposed method effectively enables real-time story identification for news monitoring systems, improving the organization of news articles.

Abstract: To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators.

[5] TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning

Kristian Miok, Blaz Škrlj, Daniela Zaharie, Marko Robnik Šikonja

Main category: cs.CL

TL;DR: TT-XAI improves clinical AI by distilling EHRs into keywords, enhancing BERT performance and LLM explanations, validated by metrics and expert study.

Details

Motivation: Clinical AI struggles with trustworthiness and interpretability in lengthy EHRs. TT-XAI aims to address this by improving classification and explanations.

Method: Uses domain-aware keyword distillation to enhance BERT and LIME, and keyword-guided LLM prompts for chain-of-thought explanations.

Result: Keyword distillation boosts BERT performance and explanation fidelity. LLM-generated explanations are more concise and clinically relevant.

Conclusion: TT-XAI provides a scalable, trustworthy solution for clinical AI, validated by metrics and human experts.

Abstract: Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support.

[6] Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition

Roberto Labadie-Tamayo, Djordje Slijepčević, Xihui Chen, Adrian Jaques Böck, Andreas Babic, Liz Freimann, Christiane Atzmüller Matthias Zeppelzauer

Main category: cs.CL

TL;DR: A transparent method called SCBM uses adjectives as interpretable concepts for hate speech detection, outperforming benchmarks and offering high interpretability.

Details

Motivation: The rise of hate speech on social media necessitates automated, transparent detection methods.

Method: SCBM maps texts to adjective-based concepts using LLMs, then uses a lightweight classifier.

Result: SCBM achieves a 0.69 macro-F1 score, outperforming benchmarks on 4/5 datasets, with a 1.8% boost when fused with transformer embeddings.

Conclusion: Adjective-based representations are effective, interpretable, and adaptable for hate speech detection and other NLP tasks.

Abstract: The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., “Speech Concept Bottleneck Model” (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks.

[7] MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

Andres Garcia Rincon, Eliseo Ferrante

Main category: cs.CL

TL;DR: MinionsLLM integrates LLMs with Behavior Trees and Formal Grammars for natural language control of multi-agent systems, validated with Gemma 3 models, showing significant performance gains.

Details

Motivation: To enable natural language control of multi-agent systems in user-defined environments by combining LLMs with structured frameworks like Behavior Trees and Formal Grammars.

Method: Uses standardized interfaces for environments and agents, and introduces two synthetic dataset generation methods (A and B) to fine-tune LLMs for better syntactic validity and task relevance.

Result: Method B achieves 92.6% syntactic validity and a 33% mean task performance improvement, with smaller models benefiting most from fine-tuning.

Conclusion: MinionsLLM is effective for multi-agent control, especially in resource-constrained scenarios, and is released open-source for reproducibility.

Abstract: This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google’s Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research.

[8] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang

Main category: cs.CL

TL;DR: MLLM-CTBench introduces a benchmark for continual instruction tuning of Multimodal Large Language Models (MLLMs), evaluating accuracy, reasoning, algorithms, and training paradigms across diverse domains.

Details

Motivation: The lack of rigorous benchmarks for continual instruction tuning in MLLMs hinders progress; MLLM-CTBench aims to fill this gap.

Method: The benchmark includes multidimensional evaluation (accuracy and reasoning), comparison of algorithms and training paradigms, and curated tasks from six domains.

Result: Key findings include model robustness, hierarchical forgetting, algorithm dependency, and KL-divergence benefits in reinforcement learning.

Conclusion: MLLM-CTBench sets a standard for continual tuning and provides practical guidance for MLLM algorithm design.

Abstract: Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation.

Yassine Jamaa, Badr AlKhamissi, Satrajit Ghosh, Martin Schrimpf

Main category: cs.CL

TL;DR: The study adapts a neuroscientific contrast localizer to identify causally relevant units in LLMs and VLMs for Theory of Mind (ToM) and mathematical reasoning tasks. Surprisingly, low-activation units sometimes had a larger impact on performance than high-activation ones, challenging the effectiveness of contrast-based localizers.

Details

Motivation: To pinpoint causally relevant units in large language and vision-language models for ToM and mathematical reasoning tasks, and assess their causal role.

Method: Contrastive stimulus sets were used to localize top-activated units in 11 LLMs and 5 VLMs, followed by targeted ablations to evaluate their causal impact on downstream accuracy.

Result: Low-activation units sometimes caused larger performance drops than high-activation ones, and mathematical localizer units often impaired ToM performance more than ToM localizer units.

Conclusion: The findings question the causal relevance of contrast-based localizers and suggest the need for broader stimulus sets to better identify task-specific units.

Abstract: This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.

[10] Objective Metrics for Evaluating Large Language Models Using External Data Sources

Haoze Du, Richard Li, Edward Gehringer

Main category: cs.CL

TL;DR: A framework for evaluating LLMs using objective metrics derived from textual materials, ensuring consistency and minimizing bias.

Details

Motivation: To address the challenges of subjective assessments in evaluating LLM performance.

Method: Uses benchmarks, factual datasets, and structured pipelines for automated, transparent scoring.

Result: Provides reproducible, bias-minimized measurements aligned with real-world applications.

Conclusion: Offers a scalable solution for LLM evaluation in high-stakes domains.

Abstract: Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.

[11] The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Schwartz-Ziv, Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: The paper critiques ROUGE for misaligning with human judgments in hallucination detection, showing human-aligned metrics like LLM-as-Judge reveal performance drops. Simple heuristics can match complex methods, highlighting flawed evaluation practices.

Details

Motivation: Address the unreliability of ROUGE in evaluating hallucination detection methods, which misrepresents true performance and misleads deployment of LLMs.

Method: Conducted human studies and compared ROUGE with human-aligned metrics (e.g., LLM-as-Judge) to evaluate hallucination detection methods, including testing simple heuristics.

Result: ROUGE has high recall but low precision, leading to misleading estimates. Human-aligned metrics show up to 45.9% performance drops for established methods. Simple heuristics rival complex techniques.

Conclusion: Semantically aware and robust evaluation frameworks are crucial for accurate hallucination detection assessment, ensuring LLM trustworthiness.

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.

[12] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury

Main category: cs.CL

TL;DR: The paper introduces FiqhQA, a benchmark for evaluating LLMs’ accuracy and abstention behavior in generating Islamic rulings across Sunni schools of thought in Arabic and English. GPT-4o leads in accuracy, while Gemini and Fanar excel in abstention. Performance drops in Arabic, highlighting language limitations.

Details

Motivation: To assess LLMs' reliability in religious domains, particularly Islamic jurisprudence, where accuracy and abstention behavior are critical but understudied.

Method: Created FiqhQA, a benchmark for zero-shot and abstention experiments across LLMs, languages, and legal schools of thought.

Result: GPT-4o is most accurate, while Gemini and Fanar handle abstention better. Performance declines in Arabic, revealing language-specific challenges.

Conclusion: Task-specific evaluation and cautious LLM deployment in religious contexts are necessary due to varying performance and language limitations.

Abstract: Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.

[13] Putnam-AXIOM: A Functional and Static Benchmark

Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo

Main category: cs.CL

TL;DR: The paper introduces Putnam-AXIOM, a contamination-resilient benchmark for evaluating LLMs’ mathematical reasoning, using competition problems and variations to test memorization and dynamic performance.

Details

Motivation: Existing benchmarks for LLMs' mathematical reasoning are saturated and compromised by training-set contamination, necessitating a more robust evaluation framework.

Method: The authors create Putnam-AXIOM, a benchmark with 522 competition problems and 100 functional variants, using programmatic perturbations to generate unseen instances. They also introduce Teacher-Forced Accuracy (TFA) to evaluate reasoning traces.

Result: On the original set, the top model scores 41.9%, but accuracy drops by 19.6% on variations, indicating memorization. Eighteen models show similar declines.

Conclusion: Putnam-AXIOM offers a rigorous, contamination-resilient framework for assessing LLMs’ mathematical reasoning, highlighting the need for dynamic benchmarks.

Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement “boxed” accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

[14] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

Shuzhou Yuan, William LaCroix, Hardik Ghoshal, Ercong Nie, Michael Färber

Main category: cs.CL

TL;DR: CoDAE framework enhances LLMs for education by using Chain-of-Thought data augmentation to improve pedagogical alignment, reasoning support, and resistance to premature answers.

Details

Motivation: Off-the-shelf LLMs underperform in educational settings due to issues like over-compliance, low adaptivity, and vulnerability to manipulative prompts.

Method: Collects real-world student-tutor dialogues, enriches them with CoT prompting, and fine-tunes LLMs on augmented datasets to address key limitations.

Result: Fine-tuned models with CoDAE provide better pedagogical guidance, support reasoning, and resist premature answer disclosure.

Conclusion: CoDAE effectively adapts LLMs for educational use, addressing critical limitations and improving tutor performance.

Abstract: Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.

[15] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li

Main category: cs.CL

TL;DR: Mol-R1 improves reasoning in molecule generation for LLMs using PRID and MoIA, outperforming baselines.

Details

Motivation: Addressing the limitations of Long-CoT reasoning models in knowledge-intensive domains like molecule discovery.

Method: Uses PRID for dataset curation and MoIA (iterative SFT + RPO) for training.

Result: Superior performance in text-based molecule reasoning generation.

Conclusion: Mol-R1 enhances explainability and reasoning in molecule generation for LLMs.

Abstract: Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.

[16] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Saketh Reddy Vemula, Dipti Mishra Sharma, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: The study investigates if morphologically aligned tokenization improves language model performance, focusing on Telugu, Hindi, and English. Findings show moderate benefits for syntax tasks but highlight tokenizer algorithms as more impactful.

Details

Motivation: Conflicting prior findings on morphological tokenization's impact, especially for complex morphology languages, prompted this investigation.

Method: Evaluated tokenizer training, finetuning, and downstream tasks, focusing on morphological alignment and tokenization quality. Created a Telugu dataset for morpheme segmentation.

Result: Morphological alignment moderately aids syntax tasks, but tokenizer algorithms (BPE vs. Unigram) are more influential. Unigram tokenizers generally outperform, with hybrid BPE-morphology variants showing promise.

Conclusion: Tokenizer choice matters more than morphological alignment alone, with Unigram models excelling and hybrid approaches offering improvements in BPE frameworks.

Abstract: Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models – starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively – though moderately – with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and R'enyi entropy showed no correlation with downstream performance.

[17] Marco-Voice Technical Report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CL

TL;DR: A multifunctional speech synthesis system, Marco-Voice, integrates voice cloning and emotion control, achieving expressive and natural speech while preserving speaker identity.

Details

Motivation: Address challenges in expressive, controllable, and natural speech generation that maintains speaker identity across diverse contexts.

Method: Uses speaker-emotion disentanglement with in-batch contrastive learning and rotational emotional embedding for smooth control.

Result: Substantial improvements in objective and subjective metrics, with competitive clarity and emotional richness.

Conclusion: Marco-Voice advances expressive neural speech synthesis; code and dataset (CSEMOTIONS) are publicly available.

Abstract: This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.

[18] Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints

Daren Yao, Jinsong Yuan, Ruike Chen

Main category: cs.CL

TL;DR: The paper introduces two lightweight DPO-based variants to improve small LLM alignment with human preferences, showing notable performance gains in benchmarks like AlpacaEval and MT-Bench.

Details

Motivation: Small LLMs struggle with aligning outputs to human preferences, especially under performance gaps, prompting the need for efficient solutions.

Method: Proposes Adaptive Margin-Sigmoid Loss and APO-hinge-zero, incorporating margin-based objectives and selective update mechanisms.

Result: APO-hinge-zero boosts win rates in AlpacaEval (+2.0 points) and MT-Bench, excelling in STEM and Humanities tasks.

Conclusion: Simple modifications to preference-based objectives can significantly enhance small LLM alignment, offering practical efficiency gains.

Abstract: Small large language models (LLMs) often face difficulties in aligning output to human preferences, particularly when operating under severe performance gaps. In this work, we propose two lightweight DPO-based variants – Adaptive Margin-Sigmoid Loss and APO-hinge-zero – to better address underperformance scenarios by introducing margin-based objectives and selective update mechanisms. Our APO-hinge-zero method, which combines hinge-induced hard-example mining with the chosen-focused optimization of APO-zero, achieves strong results. In AlpacaEval, APO-hinge-zero improves the win rate by +2.0 points and the length-controlled win rate by +1.4 points compared to the APO-zero baseline. In MT-Bench, our methods maintain competitive performance in diverse categories, particularly excelling in STEM and Humanities tasks. These results demonstrate that simple modifications to preference-based objectives can significantly enhance small LLM alignment under resource constraints, offering a practical path toward more efficient deployment.

[19] Momentum Point-Perplexity Mechanics in Large Language Models

Lorenzo Tomaz, Judd Rosenblatt, Thomas Berry Jones, Diogo Schwerz de Lucena

Main category: cs.CL

TL;DR: The paper studies hidden state dynamics in large language models using a physics-inspired ’energy’ metric, revealing training shifts behavior and introducing ‘Jacobian steering’ for controlled outputs.

Details

Motivation: To understand and control the hidden state dynamics of transformers for better interpretability, predictability, and alignment with human intent.

Method: Analyzes 20 transformer models (135M-3B parameters) using a physics-inspired ’energy’ metric and introduces ‘Jacobian steering’ for minimal hidden state perturbations.

Result: Training shifts models to a faster, more decisive regime; Jacobian steering improves semantic quality of outputs while maintaining energy conservation.

Conclusion: The physics-based approach offers a principled framework for interpretability, anomaly detection, and safe model steering.

Abstract: We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference. Across 20 open-source transformer models (135M-3B parameters), we find that a quantity combining the rate of change in hidden states and the model’s next-token certainty, analogous to energy in physics, remains nearly constant. Random-weight models conserve this “energy” more tightly than pre-trained ones, while training shifts models into a faster, more decisive regime with greater variability. Using this “log-Lagrangian” view, we derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token. This approach maintained near-constant energy in two tested models and produced continuations rated higher in semantic quality than the models’ natural outputs. Viewing transformers through this mechanics lens offers a principled basis for interpretability, anomaly detection, and low-risk steering. This could help make powerful models more predictable and aligned with human intent.

[20] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat

Main category: cs.CL

TL;DR: The paper introduces a steerable pluralistic alignment method for LLMs, using few-shot comparative regression to adapt to diverse user preferences, outperforming baselines and advancing ethical AI.

Details

Motivation: Current alignment methods like RLHF use scalar rewards, limiting representation of diverse user preferences. Pluralistic alignment aims to capture broader preferences beyond just helpfulness and harmlessness.

Method: Proposes a steerable pluralistic model using few-shot comparative regression, leveraging in-context learning and fine-grained attributes to compare responses. Evaluated using adapted MIC and HelpSteer2 datasets.

Result: The approach outperforms baselines, is interpretable, and compatible with various attributes and LLMs, demonstrating effectiveness in value-aligned decision-making and reward modeling.

Conclusion: The work advances pluralistic alignment, enabling fairer and more representative LLM use, and sets new directions for ethical AI research.

Abstract: Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.

[21] DeCAL Tokenwise Compression

Sameer Panwar

Main category: cs.CL

TL;DR: DeCAL is a tokenwise compression method using a pretrained encoder-decoder model, achieving high-quality compression with minor performance dropoff up to 8x compression.

Details

Motivation: To create a general-purpose compression method that maintains high-quality representations even at higher compression rates.

Method: Uses a denoising-pretrained encoder-decoder model with encoder modifications to maximize compression quality.

Result: Matches uncompressed performance at 2x compression, with minor dropoff up to 8x in tasks like QA, summarization, and retrieval.

Conclusion: DeCAL provides significant savings and potential for broader applicability with further development.

Abstract: This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations by the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on many downstream tasks, with usually only minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.

[22] DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives

Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, Ju-Wan Kim

Main category: cs.CL

TL;DR: DepressLLM, a large language model, improves depression prediction using a novel dataset and interpretable methods, achieving high AUC scores and robustness across diverse data.

Details

Motivation: The lack of large-scale, high-quality datasets for depression prediction motivates the development of DepressLLM to enable earlier and more accurate diagnosis.

Method: DepressLLM uses a Score-guided Token Probability Summation (SToPS) module for interpretable predictions and is trained on 3,699 autobiographical narratives.

Result: DepressLLM achieves an AUC of 0.789, rising to 0.904 for high-confidence samples, and shows robustness on diverse datasets.

Conclusion: Interpretable AI like DepressLLM holds promise for psychiatry, though limitations highlight areas for future improvement.

Abstract: Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence $\geq$ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry.

[23] Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review

David Santandreu Calonge, Linda Smail

Main category: cs.CL

TL;DR: The review explores Parameter-Efficient Fine-Tuning (PEFT) methods, especially Low-Rank Adaptation (LoRA), to enhance Retrieval-Augmented Generation (RAG) systems for Cantonese colloquial expressions. It benchmarks PEFT techniques, evaluates LoRA’s impact, and identifies challenges in linguistic nuance preservation and scalability.

Details

Motivation: To address challenges in RAG systems for Cantonese colloquial expressions due to limited annotated data and linguistic variability, optimizing PEFT methods like LoRA for better efficiency and accuracy.

Method: Systematic analysis of LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation to assess computational efficiency, retrieval precision, and linguistic authenticity.

Result: Dynamic and ensemble LoRA adaptations reduce trainable parameters without compromising accuracy, but preserving fine-grained nuances in low-resource settings like Cantonese remains challenging.

Conclusion: PEFT-enhanced RAG systems show promise for domain-specific tasks, but future work is needed for dialectal authenticity, dynamic adaptation, and scalable fine-tuning.

Abstract: This review examines recent advances in Parameter-Efficient Fine-Tuning (PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi. These systems face challenges in understanding and generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The review evaluates the integration of LoRA within RAG frameworks, benchmarks PEFT methods for retrieval and generation accuracy, identify domain adaptation strategies under limited data, and compares fine-tuning techniques aimed at improving semantic fidelity under data-scarce conditions. A systematic analysis of recent studies employing diverse LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation was conducted to assess their impact on computational efficiency, retrieval precision, linguistic authenticity, and scalability. Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without sacrificing retrieval accuracy and generation quality in dialectal contexts. However, limitations remain in fully preserving fine-grained linguistic nuances, especially for low-resource settings like Cantonese. The integration of real-time user feedback and domain-specific data remains underdeveloped, limiting model adaptability and personalization. While selective parameter freezing and nonlinear adaptation methods offer better trade-offs between efficiency and accuracy, their robustness at scale remains an open challenge. This review highlights the promise of PEFT-enhanced RAG systems for domain-specific language tasks and calls for future work targeting dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines.

[24] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Main category: cs.CL

TL;DR: RCR-Router is a dynamic, role-aware context routing framework for multi-agent LLMs, reducing token usage by up to 30% while maintaining answer quality.

Details

Motivation: Existing coordination schemes in multi-agent LLMs are inefficient due to static or full-context routing, leading to high token consumption and limited adaptability.

Method: Introduces RCR-Router, which dynamically selects relevant memory subsets per agent based on role and task stage, guided by a lightweight scoring policy. Iteratively integrates agent outputs into shared memory.

Result: Reduces token usage by up to 30% on multi-hop QA benchmarks (HotPotQA, MuSiQue, 2WikiMultihop) while improving or maintaining answer quality.

Conclusion: Structured memory routing and output-aware evaluation are key for scalable multi-agent LLM systems.

Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

[25] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

Main category: cs.CL

TL;DR: InternBootcamp is an open-source framework with 1000+ diverse tasks for LLM reasoning research, featuring automated case generation and evaluation. It improves model performance, especially through task scaling.

Details

Motivation: Real-world reasoning requires handling diverse environments, which narrow-domain benchmarks cannot capture. InternBootcamp addresses this gap.

Method: Developed an automated agent workflow for task generation, supplemented by manual validation. Includes configurable difficulty levels and verification modules.

Result: Training with InternBootcamp significantly improves model performance, with a 32B model achieving state-of-the-art results.

Conclusion: Task scaling is key to performance gains, offering a path towards generalist reasoning models.

Abstract: Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

[26] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang

Main category: cs.CL

TL;DR: The paper introduces IFRAgent, a framework to improve mobile-use agents by aligning explicit and implicit human intentions, outperforming baselines in intention alignment and step completion.

Details

Motivation: Existing mobile-use agents focus only on explicit human intention flows (e.g., step sequences), neglecting implicit flows (e.g., preferences), limiting personalization.

Method: Proposes IFRAgent, which analyzes explicit flows for SOPs and implicit flows for habits, using retrieval-augmented generation and query rewriting for alignment.

Result: IFRAgent improves intention alignment by 6.79% (32.06% relative) and step completion by 5.30% (26.34% relative) over baselines.

Conclusion: IFRAgent effectively enhances mobile-use agents by integrating explicit and implicit human intentions, achieving significant performance improvements.

Abstract: As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents’ understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79% (32.06% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30% (26.34% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake.

[27] LLaMA-Based Models for Aspect-Based Sentiment Analysis

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: Fine-tuned Orca~2 LLMs outperform state-of-the-art models in ABSA tasks but struggle in zero-shot and few-shot settings.

Details

Motivation: To explore the potential of fine-tuned open-source LLMs for ABSA, given their underperformance compared to fine-tuned models.

Method: Evaluate fine-tuned LLaMA-based models (e.g., Orca~2) on four ABSA tasks across eight English datasets.

Result: Fine-tuned Orca~2 achieves state-of-the-art results, but models perform poorly in zero-shot and few-shot scenarios.

Conclusion: Fine-tuning improves LLM performance in ABSA, but challenges remain in low-resource settings.

Abstract: While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca~2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models.

[28] UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: The paper describes a system for cross-lingual emotion detection and trigger word prediction, using fine-tuned quantized LLMs and multilingual Transformers, achieving top rankings in the WASSA-2024 shared task.

Details

Motivation: To address the challenge of cross-lingual emotion detection and trigger word prediction in tweets across multiple languages.

Method: Fine-tuning quantized large language models (Orca~2) with LoRA and multilingual Transformer models (XLM-R, mT5), enhanced by machine translation and trigger word switching.

Result: Ranked 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.

Conclusion: The proposed system demonstrates strong performance in cross-lingual emotion and trigger word detection, leveraging advanced model fine-tuning and multilingual approaches.

Abstract: This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca~2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.

[29] Prompt-Based Approach for Czech Sentiment Analysis

Jakub Šmíd, Pavel Přibáň

Main category: cs.CL

TL;DR: Prompt-based methods for Czech sentiment analysis outperform traditional fine-tuning, especially in zero-shot and few-shot scenarios.

Details

Motivation: To introduce and validate prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech, demonstrating their superiority over traditional fine-tuning.

Method: Sequence-to-sequence models are used for aspect-based tasks, with experiments in zero-shot and few-shot learning for sentiment classification.

Result: Prompting yields better results with limited training data, and pre-training on target domain data improves zero-shot performance.

Conclusion: Prompt-based approaches are effective for Czech sentiment tasks, particularly in low-resource settings.

Abstract: This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario.

[30] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement

Rajmohan C, Sarthak Harne, Arvind Agarwal

Main category: cs.CL

TL;DR: The paper proposes an efficient LLM-driven system for text-to-table generation using task decomposition and iterative self-feedback, achieving strong results on complex datasets.

Details

Motivation: Addressing the challenges LLMs face in text-to-table tasks, such as ambiguity, domain-specificity, and numerical reasoning.

Method: Breaking the task into guided sub-tasks and refining tables through iterative self-feedback.

Result: Improved table generation quality and strong performance on public datasets.

Conclusion: The system effectively balances performance and computational cost, though risks of iterative feedback are noted.

Abstract: Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain.

[31] TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Armel Zebaze, Benoît Sagot, Rachel Bawden

Main category: cs.CL

TL;DR: TopXGen improves LLM translation for low-resource languages by generating high-quality, diverse target-side texts for backtranslation.

Details

Motivation: LLMs underperform in low-resource language translation due to limited parallel data quality and diversity.

Method: TopXGen leverages LLMs to generate natural-sounding target-side texts in low-resource languages, enabling effective backtranslation.

Result: TopXGen enhances LLM translation performance in both fine-tuning and in-context learning.

Conclusion: TopXGen addresses data scarcity in low-resource MT by leveraging LLMs for synthetic data generation.

Abstract: LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.

[32] Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit

Main category: cs.CL

TL;DR: Generic multilingual ASR models outperform fine-tuned ones for older Dutch adults’ speech, with architecture truncation balancing accuracy and speed.

Details

Motivation: To evaluate ASR models for underrepresented groups like older Dutch adults in clinical chatbot interactions.

Method: Benchmarked generic multilingual and fine-tuned ASR models on older Dutch adults’ speech, considering processing speed.

Result: Generic models generalized better; truncating architectures improved speed-accuracy trade-off, though some high WER cases occurred.

Conclusion: Recent ASR models generalize well without fine-tuning, but architecture adjustments can optimize performance.

Abstract: Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations.

[33] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu

Main category: cs.CL

TL;DR: A survey of parallel text generation methods categorizes approaches into AR-based and Non-AR-based paradigms, analyzing their trade-offs and future research directions.

Details

Motivation: Address the inefficiency of autoregressive (AR) generation in LLMs by exploring parallel text generation techniques.

Method: Systematic survey categorizing methods into AR-based and Non-AR-based paradigms, analyzing their trade-offs in speed, quality, and efficiency.

Result: Identifies core techniques, theoretical trade-offs, and potential for combining or comparing with other acceleration strategies.

Conclusion: Highlights advancements, open challenges, and promising future directions in parallel text generation.

Abstract: As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation.

[34] IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization

Yuzhuo Bai, Shitong Duan, Muhua Huang, Jing Yao, Zhenghao Liu, Peng Zhang, Tun Lu, Xiaoyuan Yi, Maosong Sun, Xing Xie

Main category: cs.CL

TL;DR: IROTE is a novel method for stable and transferable trait elicitation in LLMs, addressing superficial mimicry by using self-reflection prompts optimized via an information-theoretic objective.

Details

Motivation: Existing methods for trait elicitation in LLMs are superficial and unstable, failing to consistently reflect desired traits across tasks.

Method: IROTE generates and optimizes textual self-reflection prompts based on psychological theories, enhancing trait-driven behavior without fine-tuning.

Result: Experiments show IROTE outperforms baselines, enabling stable trait impersonation across diverse tasks.

Conclusion: IROTE effectively addresses the superficial elicitation problem, offering a robust solution for trait-driven LLM behavior.

Abstract: Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs’ trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs’ behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs’ stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.

[35] Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation

Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, Liantao Ma

Main category: cs.CL

TL;DR: Magical, an asymmetric LoRA architecture, improves semantic fidelity and lay-style diversity in Medical Lay Language Generation (MLLG) by addressing limitations of standard LoRA with heterogeneous datasets.

Details

Motivation: Standard LoRA struggles with semantic fidelity and diverse lay-style generation in MLLG due to multi-source heterogeneous datasets.

Method: Proposes Magical: an asymmetric LoRA with shared matrix A for summarization and isolated matrices B for diverse styles, plus a Semantic Invariance Constraint and Recommendation-guided Switch.

Result: Magical outperforms baseline methods on three datasets, reducing trainable parameters by 31.66%.

Conclusion: Magical effectively addresses LoRA’s limitations in MLLG, enhancing both semantic accuracy and stylistic diversity.

Abstract: Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix $A$ for abstractive summarization, along with multiple isolated matrices $B$ for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix $A$. Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices $B$. Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%.

[36] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu

Main category: cs.CL

TL;DR: The paper introduces SciRerankBench, a benchmark for evaluating rerankers in RAG-LLMs for scientific question answering, addressing noise resilience, relevance disambiguation, and factual consistency.

Details

Motivation: To explore the potential and limitations of two-stage RAG-LLMs in scientific literature question answering, where subtle terminology differences impact answers.

Method: Developed SciRerankBench with three types of Q-C-A pairs (NC, SSLI, CC) and evaluated 13 rerankers across five LLMs.

Result: Provided detailed insights into reranker strengths and limitations, highlighting their performance in scientific contexts.

Conclusion: SciRerankBench is the first benchmark for rerankers in RAG-LLMs, offering valuable guidance for future development.

Abstract: Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.

[37] DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation

Stavros Doropoulos, Stavros Vologiannidis, Ioannis Magnisalis

Main category: cs.CL

TL;DR: DevNous, an LLM-based multi-agent system, automates translating unstructured team dialogue into structured IT governance artifacts, achieving high accuracy on a new benchmark.

Details

Motivation: Manual translation of team dialogue into structured IT governance artifacts is a bottleneck in systems management, necessitating automation.

Method: DevNous integrates into chat environments, identifies actionable intents, and manages workflows for tasks like task formalization and progress summaries.

Result: DevNous achieves 81.3% exact match accuracy and a 0.845 F1-Score on a 160-turn benchmark dataset.

Conclusion: The work provides a validated architectural pattern for administrative agents and introduces a public benchmark for the domain.

Abstract: The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.

[38] Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

Yunfeng Ning, Mayi Xu, Jintao Wen, Qiankun Pi, Yuanyuan Zhu, Ming Zhong, Jiawei Jiang, Tieyun Qian

Main category: cs.CL

TL;DR: The paper introduces ARoG, a privacy-protected RAG framework, to address challenges of using anonymous entities in KGs with LLMs while ensuring privacy and retrieval performance.

Details

Motivation: LLMs face issues like hallucinations and outdated knowledge. RAG integrates external knowledge but poses privacy risks with private KGs. The paper explores privacy-protected RAG where KG entities are anonymous to LLMs.

Method: Proposes ARoG with two strategies: (1) relation-centric abstraction to convert anonymous entities into retrievable info, and (2) structure-oriented abstraction to align questions with abstracted KG concepts.

Result: ARoG achieves strong performance and privacy-robustness on three datasets.

Conclusion: ARoG effectively addresses privacy and retrieval challenges in RAG systems with anonymous KG entities, demonstrating robust performance.

Abstract: LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.

[39] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Xuanjing Huang, Jiecao Chen

Main category: cs.CL

TL;DR: The paper proposes an automated pipeline for constructing training environments and a verifiable reward mechanism to enhance tool-use performance in LLMs without degrading general capabilities.

Details

Motivation: Progress in tool use for LLMs is limited by the lack of efficient RL frameworks due to challenges in stable training environments and verifiable rewards.

Method: An automated environment construction pipeline (scenario decomposition, document generation, function integration, complexity scaling, localized deployment) and a verifiable reward mechanism are introduced.

Result: Experiments show significant improvement in tool-use performance for LLMs of varying scales, without degrading general capabilities, attributed to better context understanding and reasoning.

Conclusion: The proposed approach effectively enhances LLMs’ tool-use performance through improved training environments and reward mechanisms, benefiting lower-layer MLP parameter updates.

Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.

[40] TiMoE: Time-Aware Mixture of Language Experts

Robin Faro, Dongyang Fan, Tamar Alphaidze, Martin Jaggi

Main category: cs.CL

TL;DR: TiMoE, a Time-aware Mixture of Language Experts, addresses temporal leakage in LLMs by pre-training on time-segmented data and ensuring causal validity during inference.

Details

Motivation: To prevent LLMs from relying on outdated or future information, ensuring predictions are chronologically grounded.

Method: Pre-train GPT-style experts on disjoint two-year slices of a 2013-2024 corpus, then combine them using TiMoE, which masks irrelevant experts during inference.

Result: TiMoE matches or exceeds single-period expert performance and reduces future-knowledge errors by up to 15%.

Conclusion: Modular, time-segmented pre-training with causal routing effectively grounds LLMs chronologically without significant performance loss.

Abstract: Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): https://github.com/epfml/TiMoE

[41] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao, Xiang Wan, Chengxiang Zhai

Main category: cs.CL

TL;DR: A new framework evaluates LLMs’ mathematical reasoning robustness by testing them on linguistically and parametrically varied, mathematically equivalent problems, revealing performance drops.

Details

Motivation: To assess LLMs' sensitivity to non-mathematical perturbations and improve evaluation of their mathematical reasoning.

Method: Introduces PutnamGAP, a benchmark dataset with varied math problems, and tests 18 LLMs on it.

Result: Performance degrades on variants; OpenAI’s O3 drops by 4-10.5 percentage points, smaller models worse.

Conclusion: The methodology effectively deepens understanding of LLMs’ robustness and guides improvements in mathematical reasoning.

Abstract: In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

[42] Steering Towards Fairness: Mitigating Political Bias in LLMs

Afrozah Nadeem, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: A framework for probing and mitigating ideological biases in decoder-based LLMs using internal model representations, grounded in the Political Compass Test (PCT).

Details

Motivation: Address concerns about LLMs encoding and reproducing ideological biases, particularly in political and economic dimensions.

Method: Uses contrastive pairs to analyze hidden layer activations in models like Mistral and DeepSeek, with a layer-wise extraction pipeline across ideological axes.

Result: Decoder LLMs systematically encode representational bias across layers, which can be mitigated using steering vectors.

Conclusion: Provides insights into political bias encoding in LLMs and offers a principled debiasing approach beyond surface-level interventions.

Abstract: Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.

[43] BiasGym: Fantastic Biases and How to Find (and Remove) Them

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

Main category: cs.CL

TL;DR: BiasGym is a framework for injecting, analyzing, and mitigating biases in LLMs, using BiasInject for bias injection and BiasScope for identification and mitigation.

Details

Motivation: Biases in LLMs are subtle and hard to isolate, requiring systematic tools for analysis and debiasing.

Method: BiasGym uses BiasInject to inject biases via token-based fine-tuning and BiasScope to identify and mitigate biased behavior.

Result: The framework effectively reduces real-world stereotypes and probes fictional associations without degrading downstream task performance.

Conclusion: BiasGym is a versatile tool for bias analysis and mitigation in LLMs, useful for safety and interpretability research.

Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being reckless drivers') and in probing fictional associations (e.g., people from a country having blue skin’), showing its utility for both safety interventions and interpretability research.

[44] Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance

Kaiyu Wang, Lin Mu, Zhiyao Yang, Ximing Li, Xiaotang Zhou Wanfu Gao, Huimao Zhang

Main category: cs.CL

TL;DR: The paper proposes Sqator, a span-level QA evaluator for radiology reports, automating QA scoring by analyzing fine-grained text spans, reducing labor costs and improving accuracy.

Details

Motivation: Manual QA scoring for radiology reports is labor-intensive and prone to inaccuracies due to biases and senior doctors' abilities.

Method: Sqator measures QA scores by analyzing the importance of revised spans between junior and senior reports, merging scores for the final evaluation.

Result: Sqator achieves competitive QA scores on 12,013 radiology reports, with revised span importance aligning with senior doctors’ judgments.

Conclusion: Sqator offers an efficient and accurate automated solution for QA scoring in radiology reports, addressing labor and bias issues.

Abstract: Quality Assurance (QA) for radiology reports refers to judging whether the junior reports (written by junior doctors) are qualified. The QA scores of one junior report are given by the senior doctor(s) after reviewing the image and junior report. This process requires intensive labor costs for senior doctors. Additionally, the QA scores may be inaccurate for reasons like diagnosis bias, the ability of senior doctors, and so on. To address this issue, we propose a Span-level Quality Assurance EvaluaTOR (Sqator) to mark QA scores automatically. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Specifically, Sqator measures QA scores by measuring the importance of revised spans between junior and senior reports, and outputs the final QA scores by merging all revised span scores. We evaluate Sqator using a collection of 12,013 radiology reports. Experimental results show that Sqator can achieve competitive QA scores. Moreover, the importance scores of revised spans can be also consistent with the judgments of senior doctors.

[45] Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, Isabelle Augenstein

Main category: cs.CL

TL;DR: The paper introduces Culturescope, a method to analyze cultural biases in LLMs by probing their internal representations, revealing Western-dominance bias and cultural flattening.

Details

Motivation: To understand how LLMs' internal mechanisms lead to cultural misrepresentation, addressing gaps in prior extrinsic evaluations.

Method: Proposes Culturescope, a mechanistic interpretability-based method using patching to extract cultural knowledge and a cultural flattening score to measure biases.

Result: LLMs encode Western-dominance bias and cultural flattening, with low-resource cultures less affected due to limited training data.

Conclusion: Provides a foundation for mitigating cultural biases in LLMs, with publicly available codes and data for further research.

Abstract: The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs’ cultural competence, without accounting for how LLMs’ internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs’ cultural understanding. Our codes and data used for experiments are publicly available.

[46] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun

Main category: cs.CL

TL;DR: ASPD introduces adaptive serial-parallel decoding to speed up LLM inference by leveraging intrinsic parallelism in autoregressive models, achieving up to 3.19x speedup without quality loss.

Details

Motivation: Addressing the inference latency challenges in large language models (LLMs) caused by their sequential autoregressive decoding.

Method: Proposes Adaptive Serial-Parallel Decoding (ASPD), including automated extraction of parallelizable structures and a Hybrid Decoding Engine for efficient transitions between serial and parallel modes.

Result: Achieves up to 3.19x speedup (1.85x average) on Vicuna Bench with <1% quality difference, demonstrating effectiveness and efficiency.

Conclusion: ASPD sets a benchmark for efficient LLM parallel inference, enabling deployment in latency-sensitive applications.

Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.

[47] Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

Mahmoud Salhab, Shameed Sait, Mohammad Abusheikh, Hasan Abusheikh

Main category: cs.CL

TL;DR: A scalable training pipeline combining weakly supervised learning and supervised fine-tuning achieves state-of-the-art results for Arabic ASR, addressing data scarcity and dialect diversity.

Details

Motivation: Developing accurate ASR for low-resource languages like Arabic is challenging due to limited labeled data and dialectal complexity.

Method: Pretraining on 15,000 hours of weakly labeled speech (MSA and DA), followed by supervised fine-tuning with filtered weak labels and high-quality annotated data.

Result: State-of-the-art performance, ranking first in the multi-dialectal Arabic ASR challenge.

Conclusion: Weak supervision with fine-tuning effectively overcomes data scarcity, enabling high-quality ASR for dialect-rich, low-resource languages.

Abstract: Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages.

[48] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Khondoker Ittehadul Islam, Gabriele Sarti

Main category: cs.CL

TL;DR: The paper introduces a Bangla-translated multi-step reasoning dataset and evaluates multilingual models, finding reasoning context helps non-binary questions but models struggle with Bangla reasoning steps.

Details

Motivation: To address the lack of evaluation for language models in low-resource languages like Bangla, using a translated dataset from English.

Method: Manual translation of the English Reveal dataset into Bangla, followed by controlled evaluation of multilingual models on both datasets.

Result: Reasoning context aids non-binary questions, but models underperform in utilizing Bangla reasoning steps effectively.

Conclusion: Reasoning steps impact predictions differently across models and languages, highlighting challenges in low-resource settings.

Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.

[49] Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem

Main category: cs.CL

TL;DR: A curriculum learning strategy for length-controlled reasoning in LLMs, using GRPO, outperforms fixed-budget methods by improving accuracy and token efficiency.

Details

Motivation: Existing fixed-length training budgets don't leverage the natural learning progression from exploration to compression.

Method: Proposes curriculum learning with GRPO, starting with generous token budgets and gradually tightening them, balanced by a reward function for correctness, efficiency, and formatting.

Result: Outperforms fixed-budget baselines in accuracy and token efficiency across multiple datasets (GSM8K, MATH500, etc.).

Conclusion: Progressive constraint in training serves as a powerful inductive bias for efficient reasoning models.

Abstract: Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.

[50] Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens

Lucas Albarede, Jose Moreno, Lynda Tamine, Luce Lefeuvre

Main category: cs.CL

TL;DR: LoDIT is a method for jointly generating and faithfully attributing answers in RAG by leveraging token logits, outperforming state-of-the-art models on trustworthiness metrics.

Details

Motivation: Addressing LLM hallucination and improving faithfulness in answer attribution by aligning token generation with attribution generation.

Method: LoDIT marks documents with token identifiers, uses their logits to estimate document contributions, and aggregates these into attributions.

Result: Significantly outperforms state-of-the-art models on Trust-Align benchmark, with efficiency and robustness.

Conclusion: LoDIT effectively improves faithfulness and efficiency in answer attribution for LLMs.

Abstract: Despite their impressive performances, Large Language Models (LLMs) remain prone to hallucination, which critically undermines their trustworthiness. While most of the previous work focused on tackling answer and attribution correctness, a recent line of work investigated faithfulness, with a focus on leveraging internal model signals to reflect a model’s actual decision-making process while generating the answer. Nevertheless, these methods induce additional latency and have shown limitations in directly aligning token generation with attribution generation. In this paper, we introduce LoDIT, a method that jointly generates and faithfully attributes answers in RAG by leveraging specific token logits during generation. It consists of two steps: (1) marking the documents with specific token identifiers and then leveraging the logits of these tokens to estimate the contribution of each document to the answer during generation, and (2) aggregating these contributions into document attributions. Experiments on a trustworthiness-focused attributed text-generation benchmark, Trust-Align, show that LoDIT significantly outperforms state-of-the-art models on several metrics. Finally, an in-depth analysis of LoDIT shows both its efficiency in terms of latency and its robustness in different settings.

[51] Retrospective Sparse Attention for Efficient Long-Context Generation

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

Main category: cs.CL

TL;DR: RetroAttention improves KV cache efficiency in LLMs by revising past attention outputs with new data, outperforming SOTA methods.

Details

Motivation: Addressing the inefficiency of KV cache in long-context tasks and cumulative attention errors during decoding.

Method: Introduces RetroAttention, a technique to update KV cache retrospectively using new decoding data, maintaining a lightweight output cache.

Result: Increases effective KV exposure by 1.6x and accuracy by 21.9% over SOTA methods.

Conclusion: RetroAttention enhances LLM performance in long-generation tasks by dynamically correcting attention outputs.

Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9%.

[52] LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA

Adrián Gude, Roi Santos-Ríos, Francisco Prado-Valiño, Ana Ezquerro, Jesús Vilares

Main category: cs.CL

TL;DR: A zero-shot pipeline using a Large Language Model for Tabular Question Answering, achieving rank 33/53 in SemEval 2025 Task 8.

Details

Motivation: To explore zero-shot code generation for extracting information from tabular data without task-specific fine-tuning.

Method: Modular pipeline with a code generator, column relevance identification, data type analysis, and iterative refinement for failed code.

Result: Achieved rank 33 out of 53 in the test phase, validating zero-shot code generation for Tabular QA.

Conclusion: Zero-shot code generation is viable for Tabular QA, with iterative refinement improving robustness.

Abstract: This paper describes our participation in SemEval 2025 Task 8, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning.

[53] A Survey on Training-free Alignment of Large Language Models

Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Main category: cs.CL

TL;DR: This paper reviews training-free (TF) alignment methods for large language models (LLMs), categorizing them into pre-decoding, in-decoding, and post-decoding stages, and discusses their mechanisms, limitations, and future directions.

Details

Motivation: To address the limitations of traditional fine-tuning methods, such as resource intensity and knowledge degradation, by exploring TF alignment techniques that adapt to both open-source and closed-source LLMs.

Method: Systematic review of TF alignment methods, categorized by stages (pre-decoding, in-decoding, post-decoding), with detailed analysis of mechanisms and limitations for LLMs and multimodal LLMs (MLLMs).

Result: Identifies key challenges and future directions for TF alignment, providing a structured overview of current research.

Conclusion: The survey offers guidance for practitioners and advances the development of safer, more reliable LLMs through inclusive and effective TF alignment techniques.

Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

[54] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

Chen Xu, Zhenyu Lv, Tian Lan, Xianyang Wang, Luyao Ji, Leyang Cui, Minqiang Yang, Jian Shen, Qunxi Dong, Xiuling Liu, Juan Wang, Bin Hu

Main category: cs.CL

TL;DR: The paper proposes using LLMs as supervisors to train therapists by identifying common mistakes and providing targeted feedback, addressing ethical concerns of direct patient-facing LLMs.

Details

Motivation: Ethical and safety concerns prevent direct use of LLMs in psychotherapy, but they can be repurposed to train therapists by focusing on identifiable mistakes.

Method: A novel paradigm involves creating guidelines for mistakes, building a human-in-the-loop dataset with mistake-prone and supervisor agents, and fine-tuning a model for therapist training.

Result: The fine-tuned model (MATE) provides high-quality feedback aligned with clinical guidelines, proving effective for therapist training.

Conclusion: LLMs can effectively supervise therapist training by focusing on common mistakes, offering a safer and ethical alternative to direct patient interaction.

Abstract: Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute “gold standard” for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.

[55] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu

Main category: cs.CL

TL;DR: MVISU-Bench is a bilingual benchmark for evaluating mobile agents, addressing real-world user needs with 404 tasks across 137 apps. Aider, a plug-and-play module, improves success rates by 19.55%.

Details

Motivation: Existing benchmarks fail to address real-world user diversity and complexity, prompting the need for MVISU-Bench.

Method: Developed MVISU-Bench with 404 tasks across 137 apps and introduced Aider, a dynamic prompt prompter.

Result: Aider improved success rates by 19.55%, with notable gains in unethical (53.52%) and interactive (29.41%) tasks.

Conclusion: Highlights the gap between current mobile agents and real-world expectations, demonstrating Aider’s effectiveness.

Abstract: Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users’ automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.

[56] READER: Retrieval-Assisted Drafter for Efficient LLM Inference

Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi

Main category: cs.CL

TL;DR: READER introduces a lossless speculative decoding method for LLMs, leveraging self-repetitions and statistical search to enhance efficiency, especially for large batch sizes, without additional training.

Details

Motivation: The sequential nature of LLM inference is slow and hard to accelerate, limiting deployment efficiency. Existing methods often require training draft models, which is resource-intensive.

Method: READER uses retrieval-assisted speculative decoding, expanding the decoding tree with statistically searched tokens and optimizing KV cache for large batches.

Result: READER outperforms existing methods, achieving over 40% speedup without extra training and 10x speedup in retrieval-augmented tasks.

Conclusion: READER is a highly efficient, training-free solution for accelerating LLM inference, particularly effective in large-batch and search-based applications.

Abstract: Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.

[57] CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

Xinge Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, Yongbin Li

Main category: cs.CL

TL;DR: CPO improves RLFT for subjective tasks by shifting from sample-wise to comparative group-wise scoring, using CharacterArena for robust evaluation.

Details

Motivation: Traditional RLFT struggles with subjective tasks due to unstable rewards and subjective criteria. Human evaluation combines explicit and implicit judgments, inspiring CPO.

Method: CPO replaces sample-wise scoring with comparative group-wise scoring. CharacterArena framework includes multi-turn role-playing and trajectory-level comparisons.

Result: CPO reduces reward ambiguity and improves dialogue quality, validated on CharacterEval, CharacterBench, and CharacterArena.

Conclusion: CPO and CharacterArena address RLFT challenges in subjective tasks, offering a more robust evaluation and optimization approach.

Abstract: Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.

[58] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan

Main category: cs.CL

TL;DR: A novel architecture fuses all intermediate layers of multilingual encoders with LLMs, improving performance on low-resource languages without multilingual training data.

Details

Motivation: LLMs perform poorly on low-resource languages due to English-centric training. Existing methods like LangBridge underutilize encoder layers.

Method: Proposes fusing all intermediate layers via Global Softmax and Transformer Softmax strategies, mapping representations into LLM’s embedding space.

Result: Significant improvements on LRLs: Sinhala accuracy rose from 71.66% to 75.86%, and XNLI average accuracy increased from 70.36% to 71.50%.

Conclusion: The approach provides a scalable, data-efficient solution for enhancing multilingual LLMs.

Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

[59] Link Prediction for Event Logs in the Process Industry

Anastasia Zhukova, Thomas Walton, Christian E. Matt, Bela Gipp

Main category: cs.CL

TL;DR: The paper proposes a record linking (RL) model for fragmented event logs in the process industry, using cross-document coreference resolution (CDCR) enhanced with NLI and STS, achieving significant performance improvements over baselines.

Details

Motivation: Fragmented event logs in shift books hinder knowledge management (KM) by disconnecting related records, preventing effective solution recommendations.

Method: The study frames RL as a CDCR task, enhanced with NLI and STS, and adapts it for the process industry’s text formats.

Result: The RL model outperformed NLI- and STS-driven baselines by 28% and 27%, respectively.

Conclusion: Domain adaptation of CDCR models with reasoning capabilities effectively improves data quality and connectivity in shift logs.

Abstract: Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry’s specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs.

[60] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian

Main category: cs.CL

TL;DR: AutoCodeGen automates multilingual code generation dataset creation, addressing limitations of manual benchmarks. AutoCodeBench evaluates LLMs on diverse, high-difficulty tasks, revealing their struggles.

Details

Motivation: Existing code generation benchmarks are limited by manual annotations, Python focus, and uneven language distribution. AutoCodeGen aims to overcome these issues.

Method: AutoCodeGen automates dataset creation using LLMs for test inputs, a multilingual sandbox for outputs, and filtering steps. AutoCodeBench is introduced for evaluation.

Result: Even advanced LLMs struggle with AutoCodeBench’s complexity and multilingual tasks. AutoCodeBench-Complete assesses few-shot capabilities.

Conclusion: AutoCodeBench provides a scalable, high-quality benchmark for challenging multilingual code generation, inspiring further research.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.

[61] SinLlama – A Large Language Model for Sinhala

H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur

Main category: cs.CL

TL;DR: SinLlama, a Sinhala-enhanced LLM, outperforms Llama-3-8B in text classification tasks.

Details

Motivation: Low-resource languages like Sinhala are often ignored by open-source LLMs.

Method: Extended Llama-3-8B with Sinhala vocabulary and continual pre-training on a 10M Sinhala corpus.

Result: SinLlama outperformed base and instruct variants of Llama-3-8B in text classification.

Conclusion: SinLlama is the first decoder-based open-source LLM with explicit Sinhala support, showing significant improvement.

Abstract: Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

[62] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, Saravan Rajmohan

Main category: cs.CL

TL;DR: OdysseyBench is a new benchmark for evaluating LLM agents on long-horizon workflows in office applications, addressing gaps in existing atomic task benchmarks.

Details

Motivation: Existing benchmarks focus on atomic tasks, missing long-term dependencies and multi-interaction coordination in realistic scenarios.

Method: Introduces OdysseyBench with two splits (OdysseyBench+ and OdysseyBench-Neo) and HomerAgents, a framework for automated benchmark generation.

Result: OdysseyBench effectively challenges state-of-the-art LLM agents, providing a more accurate assessment of their capabilities in complex contexts.

Conclusion: OdysseyBench is a valuable resource for advancing LLM agent development and evaluation in real-world productivity scenarios.

Abstract: Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.

[63] Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song

Main category: cs.CL

TL;DR: The paper introduces LogicIFGen and LogicIFEval to evaluate LLMs’ ability to follow logic-rich instructions, revealing significant deficiencies in current models.

Details

Motivation: To explore how well LLMs perform on logic-rich instructions, which are foundational for advanced capabilities like reasoning and agentic behaviors.

Method: Proposes LogicIFGen, a scalable framework for generating verifiable instructions from code functions, and LogicIFEval, a benchmark of 426 logic-rich instructions.

Result: Current LLMs struggle, following fewer than 60% of instructions, highlighting deficiencies in instruction-following ability.

Conclusion: The study underscores the need for improvement in LLMs’ handling of complex logic structures in instructions.

Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

[64] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

Main category: cs.CL

TL;DR: The paper introduces methods to exploit intermediate predictions in diffusion large language models (dLLMs) to improve output consistency and accuracy, achieving significant performance gains.

Details

Motivation: Current decoding strategies in dLLMs discard intermediate predictions, missing correct answers that emerge mid-process. The paper addresses this by leveraging temporal dynamics.

Method: Two methods are proposed: 1) Temporal Self-Consistency Voting (training-free, aggregates predictions across steps) and 2) Temporal Consistency Reinforcement (post-training, uses Temporal Semantic Entropy as a reward signal).

Result: Empirical results show improvements: 24.7% on Countdown, 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown.

Conclusion: The paper highlights the potential of temporal dynamics in dLLMs and offers effective tools to improve generation stability and accuracy.

Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

[65] Quantifying Gender Biases Towards Politicians on Reddit

Sara Marjanovic, Karolina Stańczak, Isabelle Augenstein

Main category: cs.CL

TL;DR: The study analyzes gender biases in online political discussions about male and female politicians, revealing disparities in professional respect and descriptor attribution despite equal public interest.

Details

Motivation: To investigate implicit gender biases in political discourse, focusing on online discussions about male and female politicians.

Method: Collected 10 million Reddit comments, analyzing linguistic and extra-linguistic cues to assess five types of gender bias.

Result: Found equal public interest in female politicians but less professional respect, with biases like first-name usage and body-related descriptors.

Conclusion: Gender biases persist in political discourse, with female politicians facing less professional treatment; dataset released for further research.

Abstract: Despite attempts to increase gender parity in politics, global efforts have struggled to ensure equal female representation. This is likely tied to implicit gender biases against women in authority. In this work, we present a comprehensive study of gender biases that appear in online political discussion. To this end, we collect 10 million comments on Reddit in conversations about male and female politicians, which enables an exhaustive study of automatic gender bias detection. We address not only misogynistic language, but also other manifestations of bias, like benevolent sexism in the form of seemingly positive sentiment and dominance attributed to female politicians, or differences in descriptor attribution. Finally, we conduct a multi-faceted study of gender bias towards politicians investigating both linguistic and extra-linguistic cues. We assess 5 different types of gender bias, evaluating coverage, combinatorial, nominal, sentimental, and lexical biases extant in social media language and discourse. Overall, we find that, contrary to previous research, coverage and sentiment biases suggest equal public interest in female politicians. Rather than overt hostile or benevolent sexism, the results of the nominal and lexical analyses suggest this interest is not as professional or respectful as that expressed about male politicians. Female politicians are often named by their first names and are described in relation to their body, clothing, or family; this is a treatment that is not similarly extended to men. On the now banned far-right subreddits, this disparity is greatest, though differences in gender biases still appear in the right and left-leaning subreddits. We release the curated dataset to the public for future studies.

[66] Utilizing Large Language Models for Information Extraction from Real Estate Transactions

Yu Zhao, Haoxiang Gao, Jinghan Cao, Shiqi Yang

Main category: cs.CL

TL;DR: The paper explores using transformer-based large language models for automating data extraction from real estate contracts, improving efficiency and accuracy.

Details

Motivation: Manual extraction of data from real estate contracts is slow and error-prone, necessitating automation.

Method: The study fine-tunes a large-language model using synthetic contracts generated from real-world transaction data.

Result: The approach achieves significant improvements in metrics and qualitative performance for information retrieval and reasoning.

Conclusion: Transformer-based models show promise for enhancing real estate contract analysis, with potential for future advancements.

Abstract: Real estate sales contracts contain crucial information for property transactions, but manual data extraction can be time-consuming and error-prone. This paper explores the application of large language models, specifically transformer-based architectures, for automated information extraction from real estate contracts. We discuss challenges, techniques, and future directions in leveraging these models to improve efficiency and accuracy in real estate contract analysis. We generated synthetic contracts using the real-world transaction dataset, thereby fine-tuning the large-language model and achieving significant metrics improvements and qualitative improvements in information retrieval and reasoning tasks.

[67] From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian

Main category: cs.CL

TL;DR: The paper addresses hallucinations in large vision-language models (LVLMs) by proposing PATCH, a plug-and-play tuning strategy to improve visual feature extraction and decoupling, achieving top performance on hallucination datasets.

Details

Motivation: Hallucinations in LVLMs, where objects not in the visual input are generated, impair reliability. The paper investigates if the root cause lies in the visual encoder or modal alignment module.

Method: The authors propose PATCH, a method using adaptive virtual tokens to extract object features from bounding boxes, addressing insufficient decoupling of visual features.

Result: PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets.

Conclusion: PATCH provides insights into LVLM hallucinations and offers a practical solution, encouraging further research and innovation in the field.

Abstract: Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model’s inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field.

[68] AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models

Yang Fan

Main category: cs.CL

TL;DR: AdEval is a dynamic evaluation method for LLMs that reduces data contamination risks by aligning with static benchmarks’ core content, using online searches for background info, and multi-level cognitive assessments.

Details

Motivation: Address data contamination in LLM evaluations and improve fairness, reliability, and diversity by avoiding static dataset reliance.

Method: Extracts knowledge points from static datasets, uses online searches for background info, designs multi-level cognitive questions, and controls dataset complexity iteratively.

Result: AdEval reduces data contamination impact, improves complexity control, and enables multi-dimensional evaluation.

Conclusion: AdEval enhances LLM evaluation fairness, reliability, and diversity by mitigating data contamination and enabling dynamic, multi-level assessment.

Abstract: As Large Language Models (LLMs) are pre-trained on ultra-large-scale corpora, the problem of data contamination is becoming increasingly serious, and there is a risk that static evaluation benchmarks overestimate the performance of LLMs. To address this, this paper proposes a dynamic data evaluation method called AdEval (Alignment-based Dynamic Evaluation). AdEval first extracts knowledge points and main ideas from static datasets to achieve dynamic alignment with the core content of static benchmarks, and by avoiding direct reliance on static datasets, it inherently reduces the risk of data contamination from the source. It then obtains background information through online searches to generate detailed descriptions of the knowledge points. Finally, it designs questions based on Bloom’s cognitive hierarchy across six dimensions-remembering, understanding, applying, analyzing, evaluating, and creating to enable multi-level cognitive assessment. Additionally, AdEval controls the complexity of dynamically generated datasets through iterative question reconstruction. Experimental results on multiple datasets show that AdEval effectively alleviates the impact of data contamination on evaluation results, solves the problems of insufficient complexity control and single-dimensional evaluation, and improves the fairness, reliability and diversity of LLMs evaluation.

[69] EvoP: Robust LLM Inference via Evolutionary Pruning

Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-wei Kuo, Nan Guan, Chun Jason Xue

Main category: cs.CL

TL;DR: EvoP is an evolutionary pruning framework for LLMs that improves performance and efficiency by using a diverse calibration dataset and optimal pruning patterns.

Details

Motivation: Existing pruning methods for LLMs are heuristic and ignore data characteristics, leading to suboptimal performance.

Method: EvoP uses cluster-based calibration dataset sampling (CCDS) and evolutionary pruning pattern searching (EPPS) to optimize pruning.

Result: EvoP outperforms existing pruning techniques in performance and efficiency across various LLMs and tasks.

Conclusion: EvoP is a practical and scalable solution for deploying LLMs in resource-constrained environments.

Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing model pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing model pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.

[70] Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Zhengping Jiang, Anqi Liu, Benjamin Van Durme

Main category: cs.CL

TL;DR: The paper introduces Conformal Linguistic Calibration (CLC), a method combining abstention and linguistic calibration to manage uncertainty in language model outputs, ensuring calibrated and actionable responses.

Details

Motivation: Addressing the unreliability of language model outputs by improving uncertainty adaptation, balancing information retention and downstream usability.

Method: Proposes CLC, reinterpreting linguistic calibration as answer set prediction, and connects abstention and calibration via linguistic pragmatics. Implements CLC to control response imprecision.

Result: Demonstrates calibrated outputs with conformal guarantees on factual accuracy and enables uncertainty-aware adaptive claim rewriting.

Conclusion: CLC offers a unified, controllable approach to balance factuality and specificity in language model responses.

Abstract: Language model outputs are not always reliable, thus prompting research into how to adapt model responses based on uncertainty. Common approaches include: \emph{abstention}, where models refrain from generating responses when uncertain; and \emph{linguistic calibration}, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unified view, Conformal Linguistic Calibration (CLC), which reinterprets linguistic calibration as \emph{answer set prediction}. First we present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation of CLC that allows for controlling the level of imprecision in model responses. Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy. Further, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.

[71] Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Mahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian Wong

Main category: cs.CL

TL;DR: Large-language models (LLMs) like Llama-3, Phi-4, and Gemma-3 can accurately automate concept extraction from CT pulmonary embolism reports, matching human standards and improving scalability for PE registries.

Details

Motivation: To evaluate if openly available LLMs can automate the extraction of concepts from CT pulmonary embolism reports without compromising data quality, reducing the need for manual abstraction.

Method: Tested four Llama-3 variants, Phi-4, and Gemma-3 on 250 dual-annotated CTPE reports from MIMIC-IV and Duke University, measuring accuracy, PPV, and NPV against human standards.

Result: Larger models achieved higher accuracy (up to 0.96 for 70B variants), with Phi-4 and Gemma-3 matching performance. Dual-model review showed high PPV (≥0.95) and NPV (≥0.98) for PE presence and other concepts.

Conclusion: LLMs provide a scalable and accurate solution for PE registry abstraction, with dual-model workflows ensuring data quality with minimal human oversight.

Abstract: Pulmonary embolism (PE) registries accelerate practice-improving research but depend on resource-intensive manual abstraction of radiology reports. We evaluated whether openly available large-language models (LLMs) can automate concept extraction from computed-tomography PE (CTPE) reports without sacrificing data quality. Four Llama-3 (L3) variants (3.0 8 B, 3.1 8 B, 3.1 70 B, 3.3 70 B) and two reviewer models Phi-4 (P4) 14 B and Gemma-3 27 B (G3) were tested on 250 dual-annotated CTPE reports each from MIMIC-IV and Duke University. Outcomes were accuracy, positive predictive value (PPV), and negative predictive value (NPV) versus a human gold standard across model sizes, temperature settings, and shot counts. Mean accuracy across all concepts increased with scale: 0.83 (L3-0 8 B), 0.91 (L3-1 8 B), and 0.96 for both 70 B variants; P4 14 B achieved 0.98; G3 matched. Accuracy differed by < 0.03 between datasets, underscoring external robustness. In dual-model concordance analysis (L3 70 B + P4 14 B), PE-presence PPV was >= 0.95 and NPV >= 0.98, while location, thrombus burden, right-heart strain, and image-quality artifacts each maintained PPV >= 0.90 and NPV >= 0.95. Fewer than 4% of individual concept annotations were discordant, and complete agreement was observed in more than 75% of reports. G3 performed comparably. LLMs therefore offer a scalable, accurate solution for PE registry abstraction, and a dual-model review workflow can further safeguard data quality with minimal human oversight.

[72] Opioid Named Entity Recognition (ONER-2025) from Reddit

Muhammad Ahmad, Rita Orji, Fida Ullah, Ildar Batyrshin, Grigori Sidorov

Main category: cs.CL

TL;DR: The study uses NLP to analyze opioid-related discussions on Reddit, creating a dataset and proposing a real-time monitoring system with high accuracy.

Details

Motivation: To address the opioid overdose crisis by extracting insights from social media data.

Method: Leveraged NLP (ONER-2025) to analyze Reddit posts, created a manually annotated dataset, and developed a real-time monitoring system using machine learning and transformer models.

Result: Achieved 97% accuracy and F1-score with transformer models, outperforming baselines by 10.23%.

Conclusion: The study demonstrates the potential of NLP and real-time monitoring to combat the opioid crisis effectively.

Abstract: The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).

[73] CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang

Main category: cs.CL

TL;DR: CrossWordBench is a new benchmark evaluating LLMs and LVLMs using crossword puzzles to test multimodal reasoning, revealing strengths and weaknesses in current models.

Details

Motivation: Existing frameworks lack dynamic interplay between text and visual constraints, prompting the need for a more integrated evaluation tool.

Method: CrossWordBench uses controllable puzzle generation in text and image formats, adjustable difficulty, and multiple evaluation strategies.

Result: Reasoning LLMs outperform non-reasoning models, while LVLMs struggle, showing a link between grid-parsing accuracy and performance.

Conclusion: The study exposes limitations in current models’ reasoning and offers a method for future multimodal constrained evaluations.

Abstract: Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly assess either text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles – a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in two formats (text and image), supports adjustable difficulty through prefill ratio control, and offers different evaluation strategies, ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

[74] ChatBench: From Static Benchmarks to Human-AI Evaluation

Serina Chang, Ashton Anderson, Jake M. Hofman

Main category: cs.CL

TL;DR: The paper introduces ChatBench, a dataset for evaluating human-LLM collaboration, showing AI-alone benchmarks don’t predict user-AI performance and improving user simulator accuracy via fine-tuning.

Details

Motivation: To address the gap in evaluating collaborative performance of humans and LLMs, as standard benchmarks like MMLU only measure AI-alone capabilities.

Method: Conducted a user study converting MMLU questions into user-AI conversations, creating ChatBench with AI-alone, user-alone, and user-AI data for analysis.

Result: Found AI-alone accuracy doesn’t predict user-AI accuracy, with notable differences in subjects like math and physics. Fine-tuning a user simulator improved correlation by over 20 points.

Conclusion: ChatBench enables better evaluation of human-LLM collaboration, and fine-tuning user simulators can enhance interactive assessment scalability.

Abstract: With the rapid adoption of LLM-based chatbots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., “AI-alone”). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on held-out questions by more than 20 points, creating possibilities for scaling interactive evaluation.

[75] Retrieval-Augmented Generation with Conflicting Evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: The paper introduces RAMDocs, a dataset simulating complex retrieval scenarios, and MADAM-RAG, a multi-agent LLM system for handling ambiguity, misinformation, and noise in RAG. It outperforms baselines on tasks like AmbigDocs and FaithEval.

Details

Motivation: Addressing the limitations of prior work, which tackled ambiguity, misinformation, or noise in isolation, by jointly handling these challenges in retrieval-augmented generation.

Method: Proposes RAMDocs for realistic conflict scenarios and MADAM-RAG, a multi-agent debate system to collate answers while discarding misinformation and noise.

Result: MADAM-RAG improves performance by up to 11.40% on AmbigDocs and 15.80% on FaithEval. RAMDocs challenges baselines (32.60 exact match score).

Conclusion: MADAM-RAG effectively handles joint challenges but leaves a gap in imbalanced evidence scenarios, indicating room for improvement.

Abstract: Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs – which requires presenting all valid answers for ambiguous queries – improving over strong RAG baselines by up to 11.40% and on FaithEval – which requires suppressing misinformation – where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

[76] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Yifan Li, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li

Main category: cs.CL

TL;DR: The paper introduces a knowledge unlearning evaluation framework for LLMs, addressing implicit dependencies in knowledge and proposing inference-based evaluation with LLM judges.

Details

Motivation: Existing unlearning methods focus on explicit fact removal, ignoring latent dependencies and non-deterministic knowledge in LLMs, leading to incomplete unlearning.

Method: Proposes a framework using knowledge graphs with confidence scores and inference-based evaluation with calibrated LLM judges.

Result: The framework provides a more realistic assessment, revealing current strategies overestimate unlearning effectiveness.

Conclusion: The proposed approach offers a rigorous and accurate evaluation of unlearning in LLMs, with public code availability.

Abstract: Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.

[77] Mind the Gap: Benchmarking LLM Uncertainty, Discrimination, and Calibration in Specialty-Aware Clinical QA

Alberto Testoni, Iacer Calixto

Main category: cs.CL

TL;DR: The paper evaluates uncertainty quantification (UQ) methods for clinical QA, analyzing eleven specialties and six question types across ten LLMs. It introduces a lightweight behavioral feature-based method and examines conformal prediction, finding UQ reliability varies by specialty and question type.

Details

Motivation: Reliable UQ is crucial for LLMs in high-risk clinical QA, but existing methods lack evaluation across diverse clinical specialties and question types.

Method: The study evaluates score-based UQ methods, introduces a lightweight behavioral feature-based approach, and examines conformal prediction across ten LLMs.

Result: UQ reliability varies by clinical specialty and question type, influenced by calibration and discrimination shifts.

Conclusion: Model selection or ensembling should consider complementary strengths and clinical context to improve UQ reliability.

Abstract: Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models). We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.

[78] Unsupervised Document and Template Clustering using Multimodal Embeddings

Phillipe R. Sampaio, Helene Maxcici

Main category: cs.CL

TL;DR: The paper proposes using multimodal embeddings for unsupervised document clustering to improve granularity, distinguishing templates within categories. It evaluates various models and highlights their potential for document processing.

Details

Motivation: To achieve finer-grained document clustering by leveraging multimodal embeddings, capturing textual, layout, and visual features for better understanding and organization.

Method: Utilizes multimodal embeddings (textual, layout, visual) with clustering algorithms (k-Means, DBSCAN, HDBSCAN+k-NN, BIRCH) and evaluates pre-trained models like SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, ColPali, Gemma3, and InternVL3.

Result: Multimodal embeddings significantly enhance document clustering, aiding intelligent document processing, layout analysis, and unsupervised classification.

Conclusion: The study demonstrates the potential of multimodal embeddings for document clustering, providing insights into model advantages/limitations and suggesting future research directions.

Abstract: This paper investigates a novel approach to unsupervised document clustering by leveraging multimodal embeddings as input to clustering algorithms such as $k$-Means, DBSCAN, a combination of HDBSCAN and $k$-NN, and BIRCH. Our method aims to achieve a finer-grained document understanding by not only grouping documents at the type level (e.g., invoices, purchase orders), but also distinguishing between different templates within the same document category. This is achieved by using embeddings that capture textual content, layout information, and visual features of documents. We evaluated the effectiveness of this approach using embeddings generated by several state-of-the-art pre-trained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, ColPali, Gemma3, and InternVL3. Our findings demonstrate the potential of multimodal embeddings to significantly enhance document clustering, offering benefits for various applications in intelligent document processing, document layout analysis, and unsupervised document classification. This work provides valuable insight into the advantages and limitations of different multimodal models for this task and opens new avenues for future research to understand and organize document collections.

[79] Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei

Main category: cs.CL

TL;DR: The paper explores the role of entropy in enhancing exploratory reasoning in LLMs, introducing a simple RL modification to improve reasoning depth and performance.

Details

Motivation: Addressing the performance plateaus in LLM reasoning by leveraging entropy to balance exploration and exploitation.

Method: Augmenting the RL advantage function with an entropy-based term to promote deeper reasoning chains.

Result: Significant improvements in the Pass@K metric, even for large K values, demonstrating enhanced LLM reasoning capabilities.

Conclusion: The proposed entropy-based method effectively pushes the boundaries of LLM reasoning by encouraging deeper exploration.

Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LLM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.

[80] Post-Completion Learning for Language Models

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang

Main category: cs.CL

TL;DR: Post-Completion Learning (PCL) extends training beyond the token, improving reasoning and self-evaluation by leveraging post-completion space.

Details

Motivation: Traditional training stops at , missing learning opportunities in post-completion space. PCL aims to enhance model reasoning and self-assessment.

Method: PCL uses white-box reinforcement learning for self-assessment and reward prediction, combining dual-track SFT and RL for multi-objective optimization.

Result: Experiments show PCL outperforms traditional SFT and RL methods across datasets and models.

Conclusion: PCL offers a novel training approach, improving output quality without sacrificing deployment efficiency.

Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence () token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

[81] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Bernd J. Kröger

Main category: cs.CL

TL;DR: DYNARTmo is a dynamic articulatory model for visualizing speech articulation in 2D, built on UK-DYNAMO, with applications in phonetics education and speech therapy.

Details

Motivation: To enhance understanding and visualization of speech articulation processes for educational and therapeutic purposes.

Method: Integrates articulatory underspecification, segmental/gestural control, and coarticulation, simulating six articulators via continuous/discrete parameters.

Result: Implemented in a web app (SpeechArticulationTrainer) with multiple views, suitable for phonetics and therapy.

Conclusion: Focuses on static modeling; future work will include dynamic movement and articulatory-acoustic integration.

Abstract: We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

Sabrina Patania, Luca Annese, Cansu Koyuturk, Azzurra Ruggeri, Dimitri Ognibene

Main category: cs.CL

TL;DR: The paper explores socially mediated learning for LLMs, introducing an ‘AI Social Gym’ where AI learners interact with teacher agents, showing improved knowledge acquisition through dialogic methods.

Details

Motivation: Addressing LLMs' limitations in online knowledge acquisition by drawing from Vygotsky's sociocultural theory, contrasting traditional AI training paradigms.

Method: Using the ‘AI Social Gym’ environment, AI learners engage in pedagogical dialogues with teacher agents, focusing on ontology acquisition.

Result: Dialogic methods, especially mixed-direction interactions, enhance LLMs’ knowledge acquisition, outperforming unidirectional methods and direct structured knowledge access.

Conclusion: Integrating pedagogical insights into AI training improves post-training knowledge acquisition, offering a complementary approach to existing strategies.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive offline datasets. However, they often face challenges in acquiring and integrating complex, knowledge online. Traditional AI training paradigms, predominantly based on supervised learning or reinforcement learning, mirror a ‘Piagetian’ model of independent exploration. These approaches typically rely on large datasets and sparse feedback signals, limiting the models’ ability to learn efficiently from interactions. Drawing inspiration from Vygotsky’s sociocultural theory, this study explores the potential of socially mediated learning paradigms to address these limitations. We introduce a dynamic environment, termed the ‘AI Social Gym’, where an AI learner agent engages in dyadic pedagogical dialogues with knowledgeable AI teacher agents. These interactions emphasize external, structured dialogue as a core mechanism for knowledge acquisition, contrasting with methods that depend solely on internal inference or pattern recognition. Our investigation focuses on how different pedagogical strategies impact the AI learning process in the context of ontology acquisition. Empirical results indicate that such dialogic approaches-particularly those involving mixed-direction interactions combining top-down explanations with learner-initiated questioning-significantly enhance the LLM’s ability to acquire and apply new knowledge, outperforming both unidirectional instructional methods and direct access to structured knowledge, formats typically present in training datasets. These findings suggest that integrating pedagogical and psychological insights into AI and robot training can substantially improve post-training knowledge acquisition and response quality. This approach offers a complementary pathway to existing strategies like prompt engineering

[83] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera, Muhammad Dehan Al Kautsar, Fajri Koto

Main category: cs.CL

TL;DR: The paper explores fine-tuning LLMs to generate role-specific responses in enterprise settings, comparing three modeling strategies and evaluating them on synthetic and adapted datasets.

Details

Motivation: Address the gap in existing safety methods by enabling role-specific access control for LLMs in enterprise environments.

Method: Three strategies are tested: BERT-based classifier, LLM-based classifier, and role-conditioned generation, evaluated on two datasets (adapted and synthetic).

Result: Performance is assessed across organizational structures, with robustness tested against prompt injection, role mismatch, and jailbreak attempts.

Conclusion: The study demonstrates the feasibility of fine-tuning LLMs for role-specific behavior, highlighting potential for enterprise applications.

Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

[84] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu

Main category: cs.CL

TL;DR: The paper introduces CSEDB, a benchmark for evaluating LLMs in clinical settings, revealing moderate performance and domain-specific advantages.

Details

Motivation: To address challenges in evaluating the safety and effectiveness of LLMs in clinical decision support.

Method: Developed CSEDB, a framework with 30 criteria, tested on six LLMs using 2,069 Q&A items reviewed by specialists.

Result: LLMs showed moderate performance (57.2% avg), with a 13.3% drop in high-risk scenarios; domain-specific models outperformed general ones.

Conclusion: CSEDB provides a standardized metric for LLM evaluation, aiding safer and more effective deployment in healthcare.

Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p $<$ 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[85] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Jiayi Wen, Tianxin Chen, Zhirun Zheng, Cheng Huang

Main category: cs.CL

TL;DR: GraphRAG enhances LLMs with structured knowledge graphs but is vulnerable to knowledge poisoning attacks (KPAs), which manipulate graph construction to mislead reasoning. Two attacks (TKPA and UKPA) show high success rates and evade detection by defenses.

Details

Motivation: To expose vulnerabilities in GraphRAG by demonstrating how minor text modifications can poison knowledge graphs and mislead downstream tasks.

Method: Proposes two KPAs: Targeted KPA (TKPA) for precise QA outcome control and Universal KPA (UKPA) for disrupting graph integrity via linguistic cues.

Result: TKPA achieves 93.1% success in manipulating QA outcomes; UKPA reduces QA accuracy from 95% to 50% with minimal text changes. Defenses fail to detect these attacks.

Conclusion: GraphRAG pipelines are highly susceptible to knowledge poisoning, and current defenses are inadequate, calling for further research in securing them.

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05% of full text modified, the QA accuracy collapses from 95% to 50%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

[86] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Hongze Tan, Jianfei Pan

Main category: cs.CL

TL;DR: The paper introduces Dynamic Entropy Weighting to improve RL for LLMs by fine-grained credit assignment via entropy-weighted rewards for tokens and sequences, outperforming baselines.

Details

Motivation: Current RL methods for LLMs use uniform rewards for all tokens, limiting performance in long-chain reasoning tasks.

Method: Proposes Dynamic Entropy Weighting with GTPO (token-level rewards) and GRPO-S (sequence-level rewards) based on entropy.

Result: Outperforms DAPO baseline, showing entropy-weighting boosts reasoning performance.

Conclusion: Dynamic Entropy Weighting enhances fine-grained credit assignment, improving deep reasoning in LLMs.

Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

[87] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-3 is a dynamic evaluation framework for LLMs, addressing data contamination and overfitting by using unseen test sets and automated integrity checks. It reveals performance ceilings and contamination issues, offering robust evaluation beyond static benchmarks.

Details

Motivation: Static benchmarks for LLMs are prone to data contamination and leaderboard overfitting, masking true model capabilities.

Method: LLMEval-3 dynamically samples from 220k graduate-level questions, employs contamination-resistant curation, anti-cheating architecture, and a calibrated LLM-as-a-judge process.

Result: A 20-month study shows performance ceilings on memorization and exposes contamination vulnerabilities, with 90% agreement between LLM and human judgments.

Conclusion: LLMEval-3 provides a credible, robust methodology for evaluating LLMs, advancing trustworthy evaluation standards.

Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

[88] LLM Unlearning Without an Expert Curated Dataset

Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger

Main category: cs.CL

TL;DR: The paper introduces an automated method to generate high-quality forget sets for post-hoc unlearning in large language models, using synthetic textbook-style data created by the models themselves.

Details

Motivation: The need to remove sensitive, harmful, or copyrighted knowledge from large language models without full retraining drives the development of scalable unlearning solutions.

Method: A structured prompting pipeline synthesizes textbook-style data requiring only a domain name as input.

Result: Synthetic datasets outperform baseline alternatives and match expert-curated ones in unlearning biosecurity, cybersecurity, and Harry Potter domains.

Conclusion: Synthetic datasets provide a scalable, practical solution for unlearning in emerging domains without manual intervention.

Abstract: Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

[89] Do Biased Models Have Biased Thoughts?

Swati Rajwal, Shivank Garg, Reem Abdel-Salam, Abdelrahman Zayed

Main category: cs.CL

TL;DR: The paper investigates whether biased language models exhibit biased thought processes using chain-of-thought prompting, finding low correlation between thought and output biases.

Details

Motivation: To understand if biased language models inherently have biased internal reasoning, addressing fairness concerns in deployment.

Method: Experiments on 5 large language models using fairness metrics to quantify 11 biases in thoughts and outputs.

Result: Low correlation (less than 0.6) between thought and output biases, with p-values < 0.001 in most cases.

Conclusion: Biased models do not necessarily have biased thoughts, unlike humans, highlighting a distinction in model reasoning.

Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: $\textit{Do biased models have biased thoughts}$? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.

[90] Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig

Main category: cs.CL

TL;DR: The paper proposes CulturalGround, a dataset to address cultural gaps in Multimodal Large Language Models (MLLMs), and introduces CulturalPangea, an open-source MLLM trained on this dataset, achieving top performance on culture-focused benchmarks.

Details

Motivation: MLLMs often misinterpret long-tail cultural entities and underperform in low-resource languages, highlighting a need for culturally inclusive solutions.

Method: A data-centric approach using Wikidata to create CulturalGround, a dataset of 22M culturally-rich VQA pairs, and training CulturalPangea with this data.

Result: CulturalPangea outperforms prior models by 5.0% on culture-focused benchmarks without degrading mainstream task performance.

Conclusion: The culturally grounded approach effectively narrows the cultural gap in MLLMs, advancing globally inclusive multimodal systems.

Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

[91] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang

Main category: cs.CL

TL;DR: REX-RAG enhances LLMs’ reasoning by combining RL with RAG, addressing dead-end reasoning paths through mixed sampling and policy correction, achieving significant performance gains.

Details

Motivation: To tackle the issue of LLMs getting stuck in unproductive reasoning paths (dead ends) during policy-driven trajectory sampling, which hinders exploration and policy optimization.

Method: Proposes REX-RAG with a Mixed Sampling Strategy (probe sampling + exploratory prompts) and a Policy Correction Mechanism (importance sampling) to correct distribution shifts.

Result: Achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over baselines across seven benchmarks.

Conclusion: REX-RAG effectively mitigates dead-end reasoning and improves LLM performance, demonstrating its potential for robust reasoning tasks.

Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as “dead ends”, committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.

[92] Jinx: Unlimited LLMs for Probing Alignment Failures

Jiahao Zhao, Liwei Dong

Main category: cs.CL

TL;DR: Jinx is a helpful-only variant of open-weight LLMs designed to respond to all queries without refusals, aiding researchers in probing alignment failures and safety evaluation.

Details

Motivation: To provide the research community with an accessible tool for studying alignment failures and safety boundaries in language models, which are currently unavailable.

Method: Develop Jinx, a variant of popular open-weight LLMs, to respond to all queries without safety filtering while retaining reasoning and instruction-following capabilities.

Result: Jinx serves as a tool for researchers to systematically evaluate alignment failures and safety boundaries in language models.

Conclusion: Jinx fills a gap by offering researchers a means to study and address alignment failures in language models.

Abstract: Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model’s capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.

cs.CV

[93] Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection

Saptarshi Banerjee, Tausif Mallick, Amlan Chakroborty, Himadri Nath Saha, Nityananda T. Takur

Main category: cs.CV

TL;DR: The paper reviews AI-based techniques for detecting plant diseases and pests, highlighting their superiority over traditional methods, with vision transformers achieving over 99.3% accuracy.

Details

Motivation: To improve crop production and reduce economic losses by leveraging AI, ML, and DL for precise and efficient plant disease and pest detection.

Method: Organizes detection techniques into five categories: hyperspectral imaging, non-visualization, visualization, modified deep learning architectures, and transformer models.

Result: Modern AI-based methods, especially vision transformers like HvT, outperform older techniques with accuracy exceeding 99.3%.

Conclusion: The study identifies system design challenges, proposes solutions, and suggests future research directions for advancing detection methods.

Abstract: Addressing plant diseases and pests is critical for enhancing crop production and preventing economic losses. Recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have significantly improved the precision and efficiency of detection methods, surpassing the limitations of manual identification. This study reviews modern computer-based techniques for detecting plant diseases and pests from images, including recent AI developments. The methodologies are organized into five categories: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning architectures, and transformer models. This structured taxonomy provides researchers with detailed, actionable insights for selecting advanced state-of-the-art detection methods. A comprehensive survey of recent work and comparative studies demonstrates the consistent superiority of modern AI-based approaches, which often outperform older image analysis methods in speed and accuracy. In particular, vision transformers such as the Hierarchical Vision Transformer (HvT) have shown accuracy exceeding 99.3% in plant disease detection, outperforming architectures like MobileNetV3. The study concludes by discussing system design challenges, proposing solutions, and outlining promising directions for future research.

[94] ImageDDI: Image-enhanced Molecular Motif Sequence Representation for Drug-Drug Interaction Prediction

Yuqin He, Tengfei Ma, Chaoyi Li, Pengsen Ma, Hongxin Xiang, Jianmin Wang, Yiping Liu, Bosheng Song, Xiangxiang Zeng

Main category: cs.CV

TL;DR: ImageDDI proposes a framework combining global and local molecular structures for drug-drug interaction prediction, outperforming existing methods.

Details

Motivation: Existing methods for DDI prediction are limited by motif-based representation learning, missing global structural insights.

Method: ImageDDI tokenizes molecules into motifs, uses a transformer-based encoder, and integrates global molecular image features via Adaptive Feature Fusion.

Result: ImageDDI outperforms state-of-the-art methods and shows competitive performance in 2D and 3D scenarios.

Conclusion: ImageDDI effectively combines local and global molecular features for superior DDI prediction.

Abstract: To mitigate the potential adverse health effects of simultaneous multi-drug use, including unexpected side effects and interactions, accurately identifying and predicting drug-drug interactions (DDIs) is considered a crucial task in the field of deep learning. Although existing methods have demonstrated promising performance, they suffer from the bottleneck of limited functional motif-based representation learning, as DDIs are fundamentally caused by motif interactions rather than the overall drug structures. In this paper, we propose an Image-enhanced molecular motif sequence representation framework for \textbf{DDI} prediction, called ImageDDI, which represents a pair of drugs from both global and local structures. Specifically, ImageDDI tokenizes molecules into functional motifs. To effectively represent a drug pair, their motifs are combined into a single sequence and embedded using a transformer-based encoder, starting from the local structure representation. By leveraging the associations between drug pairs, ImageDDI further enhances the spatial representation of molecules using global molecular image information (e.g. texture, shadow, color, and planar spatial relationships). To integrate molecular visual information into functional motif sequence, ImageDDI employs Adaptive Feature Fusion, enhancing the generalization of ImageDDI by dynamically adapting the fusion process of feature representations. Experimental results on widely used datasets demonstrate that ImageDDI outperforms state-of-the-art methods. Moreover, extensive experiments show that ImageDDI achieved competitive performance in both 2D and 3D image-enhanced scenarios compared to other models.

[95] Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization

Ke Liu, Xuanhan Wang, Qilong Zhang, Lianli Gao, Jingkuan Song

Main category: cs.CV

TL;DR: HiWL is a two-stage deep watermarking method that improves invisibility, robustness, and broad applicability, outperforming existing methods with higher accuracy and low latency.

Details

Motivation: Existing watermarking methods struggle to simultaneously meet invisibility, robustness, and broad applicability criteria, limiting their effectiveness for copyright protection.

Method: HiWL uses a two-stage approach: 1) distribution alignment learning for visual consistency and information invariance, and 2) generalized watermark representation learning to disentangle watermarks from image content.

Result: HiWL achieves 7.6% higher watermark extraction accuracy and processes 100K images in 8s, demonstrating superior performance.

Conclusion: HiWL effectively addresses the limitations of existing methods, offering a generalizable solution for deep image watermarking.

Abstract: Deep image watermarking, which refers to enable imperceptible watermark embedding and reliable extraction in cover images, has shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: 1) invisibility (imperceptible hide of watermarks), 2) robustness (reliable watermark recovery under diverse conditions), and 3) broad applicability (low latency in watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL), a two-stage optimization that enable a watermarking model to simultaneously achieve three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: 1) visual consistency between watermarked and non-watermarked images, and 2) information invariance across watermark latent representations. In this way, multi-modal inputs including watermark message (binary codes) and cover images (RGB pixels) can be well represented, ensuring the invisibility of watermarks and robustness in watermarking process thereby. The second stage employs generalized watermark representation learning to establish a disentanglement policy for separating watermarks from image content in RGB space. In particular, it strongly penalizes substantial fluctuations in separated RGB watermarks corresponding to identical messages. Consequently, HiWL effectively learns generalizable latent-space watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of proposed method. In particular, it achieves 7.6% higher accuracy in watermark extraction than existing methods, while maintaining extremely low latency (100K images processed in 8s).

[96] Designing Object Detection Models for TinyML: Foundations, Comparative Analysis, Challenges, and Emerging Solutions

Christophe EL Zeinaty, Wassim Hamidouche, Glenn Herrou, Daniel Menard

Main category: cs.CV

TL;DR: A survey on optimizing object detection models for TinyML, addressing gaps in existing literature by detailing techniques like quantization, pruning, and neural architecture search, and comparing KPIs of microcontroller-based implementations.

Details

Motivation: The rapid growth of IoT devices and their limited computational resources necessitate efficient object detection solutions, which existing surveys often overlook.

Method: The paper reviews optimization techniques (quantization, pruning, knowledge distillation, neural architecture search) and evaluates their practical implementations on microcontroller devices.

Result: Comparison of KPIs shows the maturity of current solutions in accuracy and efficiency, with a public repository for ongoing updates.

Conclusion: The survey bridges the gap between theory and practice, providing a comprehensive resource for deploying optimized object detection models in TinyML environments.

Abstract: Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.

[97] Neural Tangent Knowledge Distillation for Optical Convolutional Networks

Jinlin Xiang, Minho Choi, Yubo Zhang, Zhihao Zhou, Arka Majumdar, Eli Shlizerman

Main category: cs.CV

TL;DR: A task-agnostic and hardware-agnostic pipeline improves Hybrid Optical Neural Networks (ONNs) by addressing accuracy gaps and hardware discrepancies, using Neural Tangent Knowledge Distillation (NTKD) for training and fine-tuning.

Details

Motivation: Hybrid ONNs are energy-efficient but face accuracy gaps and hardware discrepancies, limiting adoption. Existing solutions lack generalization across tasks and hardware.

Method: Proposes a pipeline for image classification and segmentation, estimating model accuracy pre-training and using NTKD to align optical models with electronic teachers. Post-fabrication, NTKD fine-tunes the digital backend.

Result: Improves ONN performance across datasets (MNIST, CIFAR, Carvana Masking) and hardware configurations, enabling practical deployment.

Conclusion: The pipeline enhances ONN accuracy and generalization, making them viable for real-world applications.

Abstract: Hybrid Optical Neural Networks (ONNs, typically consisting of an optical frontend and a digital backend) offer an energy-efficient alternative to fully digital deep networks for real-time, power-constrained systems. However, their adoption is limited by two main challenges: the accuracy gap compared to large-scale networks during training, and discrepancies between simulated and fabricated systems that further degrade accuracy. While previous work has proposed end-to-end optimizations for specific datasets (e.g., MNIST) and optical systems, these approaches typically lack generalization across tasks and hardware designs. To address these limitations, we propose a task-agnostic and hardware-agnostic pipeline that supports image classification and segmentation across diverse optical systems. To assist optical system design before training, we estimate achievable model accuracy based on user-specified constraints such as physical size and the dataset. For training, we introduce Neural Tangent Knowledge Distillation (NTKD), which aligns optical models with electronic teacher networks, thereby narrowing the accuracy gap. After fabrication, NTKD also guides fine-tuning of the digital backend to compensate for implementation errors. Experiments on multiple datasets (e.g., MNIST, CIFAR, Carvana Masking) and hardware configurations show that our pipeline consistently improves ONN performance and enables practical deployment in both pre-fabrication simulations and physical implementations.

[98] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: MAViS is a multi-agent framework for long-sequence video storytelling, addressing limitations like poor assistive capability and suboptimal visual quality by orchestrating specialized agents across stages like script writing and video animation.

Details

Motivation: To overcome limitations in long-sequence video generation, such as poor assistive capability and limited expressiveness, by proposing a collaborative multi-agent framework.

Method: MAViS uses specialized agents across stages (script writing, shot designing, etc.) under the 3E Principle (Explore, Examine, Enhance) and Script Writing Guidelines for compatibility.

Result: MAViS achieves state-of-the-art performance in assistive capability, visual quality, and expressiveness, producing high-quality videos with narratives and background music.

Conclusion: MAViS is a scalable, modular framework that enhances creativity and produces expressive long-sequence videos with multimodal outputs.

Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.

[99] MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization

Ankan Deria, Dwarikanath Mahapatra, Behzad Bozorgtabar, Mohna Chakraborty, Snehashis Chakraborty, Sudipta Roy

Main category: cs.CV

TL;DR: MuGa-VTON is a unified multi-garment diffusion framework for virtual try-on, preserving identity and garment fidelity better than existing methods.

Details

Motivation: Existing virtual try-on methods struggle with preserving personal identity and garment fidelity, especially for multi-garment scenarios.

Method: MuGa-VTON uses three modules: GRM for garment semantics, PRM for identity/pose, and A-DiT for feature fusion via diffusion transformer.

Result: Outperforms existing methods on VITON-HD and DressCode benchmarks, producing high-fidelity, identity-preserving results.

Conclusion: MuGa-VTON is a practical solution for real-world virtual try-on, offering prompt-based customization and improved realism.

Abstract: Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.

[100] DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, Jian Yin

Main category: cs.CV

TL;DR: DreamStory is a framework for story visualization using LLMs and a multi-subject consistent diffusion model to generate coherent, subject-consistent images from textual narratives.

Details

Motivation: Existing methods struggle with creating coherent, subject-consistent sequences of images from stories. DreamStory aims to address this gap.

Method: DreamStory uses an LLM to generate descriptive prompts and a Multi-Subject consistent Diffusion model (MSD) with MMSA and MMCA modules for consistent multi-subject generation.

Result: The framework performs well in subjective and objective evaluations, validated by the DS-500 benchmark.

Conclusion: DreamStory effectively improves story visualization by ensuring subject consistency and coherence, supported by experimental results.

Abstract: Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene’s subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA and MMCA modules ensure appearance and semantic consistency with reference images and text, respectively. Both modules employ masking mechanisms to prevent subject blending. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.

[101] RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space

Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang

Main category: cs.CV

TL;DR: A framework for generating human videos with separate control over foreground, background, trajectory, and action, using 3D motion editing and text-to-video diffusion models.

Details

Motivation: Existing video generation methods lack separate control over key elements like foreground, background, trajectory, and action, limiting flexibility.

Method: Proposes a decomposed framework: motion editing in 3D space, trajectory control via 2D-to-3D unprojection, action generation via motion bank or text-to-motion, and video synthesis using text-to-video diffusion models.

Result: Achieves state-of-the-art performance in controllability and video quality on benchmarks and real-world cases.

Conclusion: The method enables flexible, realistic video generation with precise control over all key elements.

Abstract: Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.

[102] CObL: Toward Zero-Shot Ordinal Layering without User Prompting

Aneel Damaraju, Dean Hazineh, Todd Zickler

Main category: cs.CV

TL;DR: CObL is a diffusion-based architecture that generates an occlusion-ordered stack of object layers from images, generalizing to real-world scenes without prior knowledge of object count.

Details

Motivation: To improve vision by grouping pixels into objects and understanding their spatial relationships, including depth and occlusion.

Method: Uses a diffusion-based architecture (CObL) with Stable Diffusion as a prior and inference-time guidance to ensure layers composite back to the input image. Trained on synthetic multi-object tabletop scenes.

Result: Zero-shot generalization to real-world tabletop photos with novel objects, reconstructing multiple occluded objects without user input or prior knowledge of object count.

Conclusion: CObL advances unsupervised object-centric representation learning by handling real-world scenes beyond its training domain.

Abstract: Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. We capture this with a scene representation comprising an occlusion-ordered stack of “object layers,” each containing an isolated and amodally-completed object. To infer this representation from an image, we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers in parallel, using Stable Diffusion as a prior for natural objects and inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple occluded objects without user prompting and without knowing the number of objects beforehand. Unlike previous models for unsupervised object-centric representation learning, CObL is not limited to the world it was trained in.

[103] 3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control

Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi

Main category: cs.CV

TL;DR: 3DFacePolicy introduces an action-based control paradigm for 3D facial animation, outperforming frame-by-frame methods with smoother, more natural results.

Details

Motivation: Frame-by-frame vertex generation in current methods leads to unnatural and discontinuous facial movements.

Method: Proposes 3DFacePolicy, using a robotic control mechanism (diffusion policy) to predict action sequences for vertices, conditioned on audio and vertex states.

Result: Outperforms state-of-the-art methods on VOCASET and BIWI datasets, excelling in dynamic, expressive, and smooth animations.

Conclusion: 3DFacePolicy’s action-based approach significantly improves the naturalness and continuity of audio-driven 3D facial animation.

Abstract: Audio-driven 3D facial animation has achieved significant progress in both research and applications. While recent baselines struggle to generate natural and continuous facial movements due to their frame-by-frame vertex generation approach, we propose 3DFacePolicy, a pioneer work that introduces a novel definition of vertex trajectory changes across consecutive frames through the concept of “action”. By predicting action sequences for each vertex that encode frame-to-frame movements, we reformulate vertex generation approach into an action-based control paradigm. Specifically, we leverage a robotic control mechanism, diffusion policy, to predict action sequences conditioned on both audio and vertex states. Extensive experiments on VOCASET and BIWI datasets demonstrate that our approach significantly outperforms state-of-the-art methods and is particularly expert in dynamic, expressive and naturally smooth facial animations.

[104] Re:Verse – Can Your VLM Read a Manga?

Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yogesh S Rawat, Shruti Vyas

Main category: cs.CV

TL;DR: The paper highlights a gap in Vision Language Models (VLMs) for deep narrative reasoning in sequential visual storytelling, introduces a novel evaluation framework, and reveals limitations in current models’ story-level intelligence.

Details

Motivation: To address the gap in VLMs' ability to understand temporal causality and cross-panel cohesion in sequential visual narratives like manga.

Method: A novel evaluation framework combining fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment, applied to Re:Zero manga with 308 annotated panels.

Result: Current VLMs fail at genuine story-level intelligence, struggling with non-linear narratives, character consistency, and causal inference.

Conclusion: The work provides a foundation for evaluating narrative intelligence in VLMs and insights into improving deep sequential understanding in multimodal models.

Abstract: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs’ joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.

[105] TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Fu-Yun Wang, Yuchi Wang, Renrui Zhang, Peng Gao, Hongsheng Li

Main category: cs.CV

TL;DR: TIDE enhances interpretability and controllability in Diffusion Transformers (DiTs) by extracting sparse, interpretable features across timesteps, revealing hierarchical semantics learned during pretraining.

Details

Motivation: DiTs are underexplored compared to U-Net-based diffusion models, and TIDE aims to improve their interpretability and controllability.

Method: TIDE uses temporal-aware sparse autoencoders to extract sparse, interpretable activation features across timesteps in DiTs.

Result: TIDE captures temporally-varying representations, showing DiTs learn hierarchical semantics (e.g., 3D structure, object class). It maintains generation quality while enabling applications like safe image editing.

Conclusion: TIDE successfully enhances interpretability and controllability in DiTs, revealing their hierarchical learning and enabling practical applications.

Abstract: Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs-a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.

[106] VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Mansi Phute, Ravikumar Balakrishnan

Main category: cs.CV

TL;DR: VISOR introduces a visual input-based method for controlling Vision Language Models (VLMs) without invasive model access, outperforming existing techniques like system prompting and activation-based steering vectors.

Details

Motivation: Existing methods for behavioral control in VLMs are either detectable, ineffective, or require invasive model access, limiting their practicality in API-based or closed-source deployments.

Method: VISOR uses optimized visual inputs (steering images) to induce target activation patterns, enabling robust behavioral control without altering model internals.

Result: VISOR matches or exceeds steering vector performance (1-25% shifts) and outperforms system prompting (3-4% shifts) while maintaining model accuracy on unrelated tasks (99.9%).

Conclusion: VISOR offers a practical and imperceptible method for VLM control, exposing a security vulnerability in visual-based attacks and calling for new defenses.

Abstract: Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals–incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering–achieving up to 25% shifts from baseline compared to steering vectors’ modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks.

[107] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang

Main category: cs.CV

TL;DR: The paper introduces Argus Inspection, a multimodal benchmark, and the Eye of Panoptes framework to evaluate MLLMs’ visual and causal reasoning, revealing significant room for improvement.

Details

Motivation: Address challenges in MLLMs' visual fine-grained perception and commonsense causal inference by creating a robust evaluation framework.

Method: Develop Argus Inspection (a multimodal benchmark) and the Eye of Panoptes framework (using a Sigmoid metric and indicator function) for holistic evaluation.

Result: Experiments on 26 MLLMs show the highest performance in visual fine-grained reasoning is only 0.46, indicating need for improvement.

Conclusion: The study provides insights for advancing MLLMs’ capabilities in visual and causal reasoning.

Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs’ responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

[108] Training Kindai OCR with parallel textline images and self-attention feature distance-based loss

Anh Le, Asanobu Kitamoto

Main category: cs.CV

TL;DR: The paper proposes a method to improve OCR for historical Kindai documents by using parallel textline images and a distance-based objective function, reducing error rates significantly.

Details

Motivation: Transcribing Kindai documents is labor-intensive, and limited annotated data hinders OCR system training.

Method: Leverages parallel textline images (original and contemporary Japanese) with a distance-based objective function (Euclidean distance and MMD) to minimize feature gaps.

Result: Reduces character error rate by 2.23% (Euclidean) and 3.94% (MMD) over a baseline, improving OCR performance.

Conclusion: The approach effectively addresses data scarcity and enhances OCR accuracy for historical documents.

Abstract: Kindai documents, written in modern Japanese from the late 19th to early 20th century, hold significant historical value for researchers studying societal structures, daily life, and environmental conditions of that period. However, transcribing these documents remains a labor-intensive and time-consuming task, resulting in limited annotated data for training optical character recognition (OCR) systems. This research addresses this challenge of data scarcity by leveraging parallel textline images - pairs of original Kindai text and their counterparts in contemporary Japanese fonts - to augment training datasets. We introduce a distance-based objective function that minimizes the gap between self-attention features of the parallel image pairs. Specifically, we explore Euclidean distance and Maximum Mean Discrepancy (MMD) as domain adaptation metrics. Experimental results demonstrate that our method reduces the character error rate (CER) by 2.23% and 3.94% over a Transformer-based OCR baseline when using Euclidean distance and MMD, respectively. Furthermore, our approach improves the discriminative quality of self-attention representations, leading to more effective OCR performance for historical documents.

[109] Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers

Wenhao Liang, Wei Emma Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen

Main category: cs.CV

TL;DR: Calibration Attention (CalAttn) improves probability calibration in Vision Transformers by learning per-instance temperatures, reducing errors significantly without sacrificing accuracy.

Details

Motivation: Probability calibration is crucial for risk-sensitive applications of Vision Transformers, but standard methods like temperature scaling are limited by global scaling and require validation sets.

Method: CalAttn learns adaptive, per-instance temperatures directly from the ViT’s CLS token, adding minimal parameters.

Result: CalAttn reduces calibration error by up to 4x on multiple datasets (CIFAR-10/100, MNIST, Tiny-ImageNet, ImageNet-1K) for ViT-224, DeiT, and Swin.

Conclusion: CalAttn is a simple, efficient, and architecture-agnostic solution for trustworthy probabilities in Vision Transformers.

Abstract: Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT’s CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-

[110] Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation

Wei Li, Pengcheng Zhou, Linye Ma, Wenyi Zhao, Huihua Yang

Main category: cs.CV

TL;DR: A generic framework (DTLP-Net) is proposed to address semi-supervised medical image segmentation challenges, leveraging diverse teacher models and label propagation for reliable pseudo-labels and improved performance.

Details

Motivation: To overcome limitations of conventional methods in handling limited annotation and domain shift in medical image segmentation, aiming for a unified solution for SSMIS, Semi-MDG, and UMDA tasks.

Method: DTLP-Net uses a student model and two diverse teacher models for pseudo-label generation, coupled with data augmentation and label propagation to enhance robustness.

Result: Notable improvements over state-of-the-art methods on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks.

Conclusion: The framework demonstrates potential for addressing challenging semi-supervised learning scenarios in medical image segmentation.

Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.

[111] A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends

Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, Xianming Liu

Main category: cs.CV

TL;DR: The paper surveys the emerging All-in-One Image Restoration (AiOIR) paradigm, which unifies the handling of multiple image degradations, contrasting traditional methods that focus on single issues. It provides a taxonomy, evaluates challenges, and suggests future directions.

Details

Motivation: Traditional image restoration methods specialize in single degradation types, limiting their real-world applicability. AiOIR aims to address this by unifying restoration for multiple degradations.

Method: The survey systematically categorizes AiOIR methods by architectural designs, learning paradigms, and innovations, and evaluates them using consolidated datasets and protocols.

Result: The paper presents a structured taxonomy of AiOIR methods, highlights challenges, and compares advanced open-source models.

Conclusion: AiOIR represents a promising direction for unified image restoration, and the survey aims to guide future research toward more adaptable systems.

Abstract: Image restoration (IR) seeks to recover high-quality images from degraded observations caused by a wide range of factors, including noise, blur, compression, and adverse weather. While traditional IR methods have made notable progress by targeting individual degradation types, their specialization often comes at the cost of generalization, leaving them ill-equipped to handle the multifaceted distortions encountered in real-world applications. In response to this challenge, the all-in-one image restoration (AiOIR) paradigm has recently emerged, offering a unified framework that adeptly addresses multiple degradation types. These innovative models enhance the convenience and versatility by adaptively learning degradation-specific features while simultaneously leveraging shared knowledge across diverse corruptions. In this survey, we provide the first in-depth and systematic overview of AiOIR, delivering a structured taxonomy that categorizes existing methods by architectural designs, learning paradigms, and their core innovations. We systematically categorize current approaches and assess the challenges these models encounter, outlining research directions to propel this rapidly evolving field. To facilitate the evaluation of existing methods, we also consolidate widely-used datasets, evaluation protocols, and implementation practices, and compare and summarize the most advanced open-source models. As the first comprehensive review dedicated to AiOIR, this paper aims to map the conceptual landscape, synthesize prevailing techniques, and ignite further exploration toward more intelligent, unified, and adaptable visual restoration systems. A curated code repository is available at https://github.com/Harbinzzy/All-in-One-Image-Restoration-Survey.

Yunqi Miao, Zhiyu Qu, Mingqi Gao, Changrui Chen, Jifei Song, Jungong Han, Jiankang Deng

Main category: cs.CV

TL;DR: FLIPNET addresses gaps in blind face restoration (BFR) by switching between Restoration and Degradation modes, improving authenticity and fidelity.

Details

Motivation: The gap between vanilla diffusion models and BFR settings, due to discrepancies in image quality and real-world degradation, hinders effective adaptation.

Method: FLIPNET uses a unified network with two modes: Restoration for integrating BFR features and Degradation for synthesizing realistic degraded images.

Result: FLIPNET outperforms prior diffusion-based BFR methods in authenticity and fidelity, and better models real-world degradations.

Conclusion: FLIPNET effectively bridges the gap in BFR by leveraging dual modes, enhancing restoration and degradation synthesis.

Abstract: Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images. The vanilla diffusion model is trained on images with no or less degradations, whereas BFR handles moderately to severely degraded images. Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate complex and unknown degradations in real-world scenarios. In this work, we use a unified network FLIPNET that switches between two modes to resolve specific gaps. In Restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration. In Degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets. Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.

[113] Think as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines

Tuo Liu, Qinghan Yang, Yu Zhang, Rongjun Ge, Yang Chen, Guangquan Zhou

Main category: cs.CV

TL;DR: AutoSAME combines SAM’s visual understanding with segmentation and landmark localization for LV measurements, introducing FCBA and SGPA to enhance accuracy.

Details

Motivation: Existing algorithms struggle with small datasets and lack anatomical point identification, necessitating a more robust solution.

Method: Proposes AutoSAME, integrating SAM for segmentation and landmark tasks, with FCBA for feature enhancement and SGPA for spatial-guided prompts.

Result: Demonstrates superior performance in LV segmentation, landmark localization, and indicator measurements on an echocardiography dataset.

Conclusion: AutoSAME effectively addresses limitations of existing methods, aligning with clinical guidelines for LV measurements.

Abstract: Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.

[114] Enhancing Wide-Angle Image Using Narrow-Angle View of the Same Scene

Hussain Md. Safwan, Mahbub Islam Mahim

Main category: cs.CV

TL;DR: The paper proposes a GAN-based method to enhance wide-angle photos by transferring fine details from narrow-angle shots, using residual connections and attention-based fusion.

Details

Motivation: The challenge of balancing scene coverage and detail in photography motivates the development of a method to combine the benefits of wide and narrow-angle shots.

Method: A GAN model is trained to extract visual quality parameters from narrow-angle images and transfer them to wide-angle images using residual connections and an attention-based fusion module.

Result: The method is evaluated on benchmark datasets and compared with contemporary techniques, demonstrating improved detail transfer in wide-angle images.

Conclusion: The proposed technique effectively enhances wide-angle images by incorporating finer details from narrow-angle shots, offering a practical solution to the coverage-detail trade-off.

Abstract: A common dilemma while photographing a scene is whether to capture it at a wider angle, allowing more of the scene to be covered but in less detail or to click in a narrow angle that captures better details but leaves out portions of the scene. We propose a novel method in this paper that infuses wider shots with finer quality details that is usually associated with an image captured by the primary lens by capturing the same scene using both narrow and wide field of view (FoV) lenses. We do so by training a Generative Adversarial Network (GAN)-based model to learn to extract the visual quality parameters from a narrow-angle shot and to transfer these to the corresponding wide-angle image of the scene using residual connections and an attention-based fusion module. We have mentioned in details the proposed technique to isolate the visual essence of an image and to transfer it into another image. We have also elaborately discussed our implementation details and have presented the results of evaluation over several benchmark datasets and comparisons with contemporary advancements in the field.

[115] Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation

Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, Qi Lei

Main category: cs.CV

TL;DR: A method using superclass information and gradient-based attention to reduce reliance on spurious features, improving robustness without auxiliary annotations.

Details

Motivation: Prior methods require impractical auxiliary annotations and assume identical group sets across domains, limiting real-world applicability.

Method: Leverages superclass labels and gradient-based attention guided by a vision-language model to disentangle features, promoting superclass-relevant features for prediction.

Result: Outperforms baselines in domain generalization tasks, with improvements in metrics and visualizations.

Conclusion: The approach effectively reduces reliance on spurious features without needing source annotations, enhancing robustness.

Abstract: To enhance group robustness to spurious correlations, prior work often relies on auxiliary annotations for groups or spurious features and assumes identical sets of groups across source and target domains. These two requirements are both unnatural and impractical in real-world settings. To overcome these limitations, we propose a method that leverages the semantic structure inherent in class labels–specifically, superclass information–to naturally reduce reliance on spurious features. Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features. Then, by promoting the use of all superclass-relevant features for prediction, our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples. Experiments across diverse datasets demonstrate that our method significantly outperforms baselines in domain generalization tasks, with clear improvements in both quantitative metrics and qualitative visualizations.

[116] Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction

Yucheng Lu, Shunxin Wang, Dovile Juodelyte, Veronika Cheplygina

Main category: cs.CV

TL;DR: The paper proposes Global Deep Curve Estimation (GDCE) to address domain-specific exposure mismatch in medical images, improving model robustness and transparency.

Details

Motivation: To improve model robustness in medical image analysis by addressing domain-specific image dynamics that linear transforms cannot handle.

Method: Reformulates image harmonization as an exposure correction problem, using GDCE with a polynomial function and domain discriminator for training.

Result: Shows that nonlinear domain-specific image dynamics require specialized methods like GDCE for effective enhancement.

Conclusion: GDCE improves model transparency and robustness in downstream tasks compared to black-box methods.

Abstract: In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with a “domain discriminator”, aiming to improve model transparency in downstream tasks compared to existing black-box methods.

[117] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: DocThinker introduces a rule-based RL framework for dynamic reasoning in MLLMs, improving explainability and adaptability over fixed CoT methods.

Details

Motivation: Address the black-box nature of MLLMs in document understanding, ensuring reliability in high-stakes domains like legal and medical.

Method: Uses rule-based RL for dynamic inference-time reasoning, generating explainable intermediate results like structured reasoning and RoIs.

Result: Outperforms fixed CoT methods in generalization, adaptability, and transparency across benchmarks.

Conclusion: RL enhances explainability and adaptability in MLLMs, offering a viable alternative to static reasoning methods.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.

[118] QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang, Wolin Liang, Yu Lei, Weiying Xue, Nan Zhuang, Qi Liu

Main category: cs.CV

TL;DR: QueryCraft improves HOI detection by using semantic priors and guided feature learning via transformer-based query initialization, achieving state-of-the-art results.

Details

Motivation: Randomly initialized queries in DETR-based HOI detection lack explicit semantics, limiting performance.

Method: Proposes QueryCraft with ACTOR (cross-modal Transformer) for action-relevant features and PDQD for object-level query quality.

Result: Achieves state-of-the-art performance on HICO-Det and V-COCO benchmarks.

Conclusion: QueryCraft’s dual-branch query initialization enhances interpretability and effectiveness in HOI detection.

Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbf{ACTOR} (\textbf{A}ction-aware \textbf{C}ross-modal \textbf{T}ransf\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \textbf{P}erceptual \textbf{D}istilled \textbf{Q}uery \textbf{D}ecoder (\textbf{PDQD}), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.

[119] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback

Hongbo Ma, Fei Shen, Hongbin Xu, Xiaoce Wang, Gang Xu, Jinkai Zheng, Liangqiong Qu, Ming Li

Main category: cs.CV

TL;DR: StyleTailor is a collaborative agent framework for personalized fashion styling, integrating design, recommendation, virtual try-on, and evaluation, enhanced by iterative visual refinement via negative feedback.

Details

Motivation: Personalized fashion styling is underexplored despite its potential to improve shopping experiences.

Method: StyleTailor uses two core agents (Designer and Consultant) with hierarchical vision-language feedback and negative prompts for iterative refinement.

Result: Outperforms baselines in personalized designs and recommendations, setting a new benchmark for fashion systems.

Conclusion: StyleTailor advances intelligent fashion systems with adaptive user alignment and comprehensive evaluation.

Abstract: The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality. To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor’s superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.

[120] Yan: Foundational Interactive Video Generation

Yan Team

Main category: cs.CV

TL;DR: Yan is a framework for interactive video generation, integrating simulation, multi-modal generation, and multi-granularity editing to enable real-time, action-controllable, and style-flexible video creation.

Details

Motivation: To advance interactive video generation by combining simulation, generation, and editing into a unified framework, enabling flexible and real-time creative tools.

Method: Yan uses a 3D-VAE for simulation, a hierarchical autoregressive caption method for multi-modal generation, and a hybrid model for disentangling mechanics and rendering for editing.

Result: Achieves real-time 1080P/60FPS simulation, strong generalization across domains, and multi-granularity editing during interaction.

Conclusion: Yan integrates these modules into a comprehensive AI-driven interactive creation paradigm, paving the way for next-gen creative tools and media.

Abstract: We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

[121] Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Jihwan Park, Taehoon song, Sanghyeok Lee, Miso Choi, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: TransMiter is a lightweight adapter for VLMs that transfers adaptation knowledge without backpropagation, improving performance efficiently.

Details

Motivation: Fine-tuning large VLMs is costly; existing methods lack transferability and are computationally expensive.

Method: TransMiter captures knowledge gaps between pre-trained and fine-tuned VLMs unsupervisedly and transfers it across models.

Result: TransMiter enhances performance with minimal inference cost and can surpass fine-tuned models with few labeled data.

Conclusion: TransMiter efficiently transfers adaptation knowledge across VLMs, preserving generalization in visual recognition tasks.

Abstract: Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from ‘weaker’ models to efficiently enhance ‘stronger’ ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models ‘without backpropagation’. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an ‘unsupervised’ manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

[122] SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones

Honglei Xu, Zhilu Zhang, Junjie Fan, Xiaohe Wu, Wangmeng Zuo

Main category: cs.CV

TL;DR: A self-supervised method for handheld video deblurring using sharp clues, SEVD for better training data, and SCSCM for spatial consistency, outperforming existing methods.

Details

Motivation: Handheld mobile videos often suffer from blur due to instability, and existing methods struggle with real-world blur domain gaps.

Method: Extracts sharp clues as labels, uses SEVD for high-quality paired data, and SCSCM for spatial consistency.

Result: Outperforms existing self-supervised methods on synthetic and real-world datasets.

Conclusion: The proposed method effectively addresses handheld video deblurring with publicly available code and datasets.

Abstract: Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model’s ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://github.com/cshonglei/SelfHVD.

[123] Neural Artistic Style and Color Transfer Using Deep Learning

Justin London

Main category: cs.CV

TL;DR: The paper introduces a method combining neural artistic style transfer with color transfer, using KL divergence to evaluate color and luminance histogram matching algorithms.

Details

Motivation: To enhance artistic expression and image correction by merging neural artistic style transfer with color transfer techniques.

Method: Uses KL divergence to evaluate color and luminance histogram matching algorithms (Reinhard, IDT, IDT with regrain, Cholesky, PCA) between original and style-transferred images.

Result: Various experiments assess the KL divergence of these algorithms and their color histograms for style-to-content transfer.

Conclusion: The proposed methodology effectively combines neural artistic style transfer with color transfer, providing a quantitative evaluation framework.

Abstract: Neural artistic style transfers and blends the content and style representation of one image with the style of another. This enables artists to create unique innovative visuals and enhances artistic expression in various fields including art, design, and film. Color transfer algorithms are an important in digital image processing by adjusting the color information in a target image based on the colors in the source image. Color transfer enhances images and videos in film and photography, and can aid in image correction. We introduce a methodology that combines neural artistic style with color transfer. The method uses the Kullback-Leibler (KL) divergence to quantitatively evaluate color and luminance histogram matching algorithms including Reinhard global color transfer, iteration distribution transfer (IDT), IDT with regrain, Cholesky, and PCA between the original and neural artistic style transferred image using deep learning. We estimate the color channel kernel densities. Various experiments are performed to evaluate the KL of these algorithms and their color histograms for style to content transfer.

[124] Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Jiahua Dong, Hui Yin, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: The paper introduces HVPL, a Hierarchical Visual Prompt Learning model, to address catastrophic forgetting in Video Instance Segmentation (VIS) by learning new object categories without losing old ones.

Details

Motivation: Existing VIS approaches assume fixed object categories and suffer from catastrophic forgetting when learning new classes. This limits their practicality in dynamic environments.

Method: HVPL uses frame-level and video-level prompts: a task-specific frame prompt with an OGC module to prevent forgetting at the frame level, and a video prompt with a context decoder to address forgetting at the video level.

Result: HVPL outperforms baseline methods, effectively mitigating catastrophic forgetting while learning new object categories.

Conclusion: HVPL provides a robust solution for continuous learning in VIS by preserving knowledge of old classes while adapting to new ones.

Abstract: Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.

[125] AME: Aligned Manifold Entropy for Robust Vision-Language Distillation

Guiming Cao, Yuming Ou

Main category: cs.CV

TL;DR: The paper proposes AME, a method for robust vision-language knowledge distillation under low-data regimes by minimizing entropy over a shared manifold.

Details

Motivation: Addressing the challenge of robust generalization in vision-language distillation due to data scarcity and predictive uncertainty.

Method: AME minimizes entropy over a reconfigured shared manifold using projection functions for cross-modal feature compression.

Result: AME improves generalization across diverse distillation frameworks and tasks without modifying backbone architectures.

Conclusion: AME is a plug-and-play solution for robust vision-language distillation, supported by theoretical and experimental validation.

Abstract: Knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the context of the recent emergence of large vision-language models (VLMs). However, vision-language knowledge distillation often requires sufficient training data to achieve robust generalization on amples with ambiguous or boundary-adjacent representations, which are associated with high predictive uncertainty. Critically, collecting such large-scale, task-specific data for training is often impractical in real-world scenarios. To address this major challenge arising from the entanglement of uncertainty and cross-modal feature representation, we propose Aligned Manifold Entropy for Robust Vision-Language Distillation (AME), aiming to achieve robust generalization under real-world conditions. AME applies entropy minimization over a reconfigured shared manifold, where multi-modal data (i.e., image and text) are bridged through a pair of projection functions, conducive to structural compression for cross-modal feature representations. This enables robust knowledge distillation under low-data regimes, while requiring no architectural modifications to the backbone. As a result, it can serve as a plug-and-play module compatible with a wide range of vision-language distillation frameworks. Notably, our theoretical analysis reveals that integrating knowledge distillation with entropy minimization over the shared manifold leads to a tighter generalization error bound. Extensive experiments across diverse distillation architectures and training settings demonstrate that AME consistently facilitates robust knowledge distillation, resulting in superior generalization performance across a wide spectrum of downstream tasks.

[126] Unified and Semantically Grounded Domain Adaptation for Medical Image Segmentation

Xin Wang, Yin Guo, Jiamin Xia, Kaiyu Zhang, Niranjan Balu, Mahmud Mossa-Basha, Linda Shapiro, Chun Yuan

Main category: cs.CV

TL;DR: A unified framework for unsupervised domain adaptation in medical image segmentation, supporting both source-accessible and source-free settings by leveraging a domain-agnostic probabilistic manifold for anatomical knowledge.

Details

Motivation: Addressing the lack of a structured, generalizable anatomical knowledge construction in prior domain adaptation methods for medical image segmentation.

Method: Introduces a model that learns a domain-agnostic probabilistic manifold representing anatomical regularities, enabling disentangled and interpretable predictions.

Result: Achieves state-of-the-art performance in both source-accessible and source-free settings, with high interpretability via manifold traversal.

Conclusion: The framework provides a unified, semantically grounded solution with intrinsic adaptability and strong interpretability, bridging the gap between source-accessible and source-free domain adaptation.

Abstract: Most prior unsupervised domain adaptation approaches for medical image segmentation are narrowly tailored to either the source-accessible setting, where adaptation is guided by source-target alignment, or the source-free setting, which typically resorts to implicit supervision mechanisms such as pseudo-labeling and model distillation. This substantial divergence in methodological designs between the two settings reveals an inherent flaw: the lack of an explicit, structured construction of anatomical knowledge that naturally generalizes across domains and settings. To bridge this longstanding divide, we introduce a unified, semantically grounded framework that supports both source-accessible and source-free adaptation. Fundamentally distinct from all prior works, our framework’s adaptability emerges naturally as a direct consequence of the model architecture, without the need for any handcrafted adaptation strategies. Specifically, our model learns a domain-agnostic probabilistic manifold as a global space of anatomical regularities, mirroring how humans establish visual understanding. Thus, the structural content in each image can be interpreted as a canonical anatomy retrieved from the manifold and a spatial transformation capturing individual-specific geometry. This disentangled, interpretable formulation enables semantically meaningful prediction with intrinsic adaptability. Extensive experiments on challenging cardiac and abdominal datasets show that our framework achieves state-of-the-art results in both settings, with source-free performance closely approaching its source-accessible counterpart, a level of consistency rarely observed in prior works. Beyond quantitative improvement, we demonstrate strong interpretability of the proposed framework via manifold traversal for smooth shape manipulation.

[127] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion

Tao Luo, Weihua Xu

Main category: cs.CV

TL;DR: The paper proposes MMIF-AMIN, a novel multimodal medical image fusion method, using an Invertible Dense Network and Multi-scale Complementary Feature Extraction Module for superior performance.

Details

Motivation: To enhance medical diagnosis by integrating unique and complementary features from different modalities into a comprehensive image.

Method: Uses an Invertible Dense Network (IDN) for lossless feature extraction and a Multi-scale Complementary Feature Extraction Module (MCFEM) with hybrid attention, convolutional layers, and Transformers. Introduces an adaptive loss function.

Result: Outperforms nine state-of-the-art methods in quantitative and qualitative analyses. Ablation studies confirm component effectiveness.

Conclusion: MMIF-AMIN is effective for MMIF and shows promise for other image fusion tasks.

Abstract: Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance.

[128] PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences

Yimeng Geng, Mingyang Zhao, Fan Xu, Guanglin Cao, Gaofeng Meng, Hongbin Liu

Main category: cs.CV

TL;DR: PADReg, a physics-aware deformable registration framework, improves ultrasound image alignment by using contact force as a physical prior, outperforming existing methods by 21.34%.

Details

Motivation: Ultrasound deformable registration is challenging due to low contrast, noise, and ambiguous tissue boundaries, leading to poor alignment and lack of interpretability in existing methods.

Method: PADReg constructs a stiffness map using contact force and ultrasound data, then estimates deformation fields via a physics-aware module inspired by Hooke’s law.

Result: Achieves a HD95 of 12.90, 21.34% better than state-of-the-art methods.

Conclusion: PADReg provides physically plausible registration with superior anatomical alignment, validated by in-vivo experiments.

Abstract: Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke’s law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34% better than state-of-the-art methods. The source code is available at https://github.com/evelynskip/PADReg.

[129] ROD: RGB-Only Fast and Efficient Off-road Freespace Detection

Tong Sun, Hongliang Ye, Jilin Mei, Liang Chen, Fangzhou Zhao, Leiqiang Zong, Yu Hu

Main category: cs.CV

TL;DR: ROD introduces an RGB-only method for off-road freespace detection, eliminating LiDAR dependency and achieving 50 FPS, outperforming prior models.

Details

Motivation: Off-road freespace detection is challenging due to blurred boundaries. Multi-modal methods (RGB + LiDAR) are slow, making them unsuitable for real-time applications.

Method: Uses a pre-trained Vision Transformer (ViT) for RGB feature extraction and a lightweight decoder to enhance precision and speed.

Result: ROD sets a new SOTA on ORFD and RELLIS-3D datasets with 50 FPS, surpassing previous models.

Conclusion: ROD demonstrates that RGB-only methods can achieve high performance and real-time speed, making them practical for off-road scenarios.

Abstract: Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models.

[130] Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos

Qi Zheng, Li-Heng Chen, Chenlong He, Neil Berkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik, Yibo Fan, Zhengzhong Tu

Main category: cs.CV

TL;DR: The paper introduces a new video dataset (LIVE-YT-Banding) and a no-reference quality evaluator (CBAND) to address banding artifacts in compressed videos, showing superior performance over existing models.

Details

Motivation: Banding artifacts in compressed videos degrade perceptual quality, especially on high-definition displays, but existing datasets lack temporal dynamics.

Method: Created a new video dataset (LIVE-YT-Banding) and developed CBAND, a no-reference quality evaluator leveraging deep neural network embeddings.

Result: CBAND outperforms state-of-the-art models in perceptual banding prediction and is significantly faster. It also serves as a differentiable loss for debanding models.

Conclusion: The LIVE-YT-Banding dataset and CBAND provide valuable resources for video quality assessment and debanding optimization, with publicly available tools.

Abstract: Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.

[131] SafeFix: Targeted Model Repair via Controlled Image Generation

Ouyang Xu, Baoming Zhang, Ruiyu Mao, Yunhui Guo

Main category: cs.CV

TL;DR: The paper introduces a model repair module using a conditional text-to-image model and LVLM to generate and filter synthetic images for rare-case augmentation, improving model robustness.

Details

Motivation: Deep learning models for visual recognition often fail on underrepresented subpopulations, and existing repair methods are prone to distribution shift and semantic errors.

Method: Uses a conditional text-to-image model to generate targeted images for failure cases, filtered by a large vision-language model (LVLM) for quality and relevance.

Result: Significantly reduces errors associated with rare cases without introducing new bugs.

Conclusion: The proposed targeted repair strategy enhances model robustness effectively.

Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix

[132] Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT

Zunjie Xiao, Xiao Wu, Tianhang Liu, Lingxi Hu, Yinling Zhang, Xiaoqing Zhang, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: The paper introduces an Adaptive Confidence-Wise (ACW) loss for lens structure segmentation, addressing inhomogeneous sub-regions and boundary calibration. It outperforms existing methods with significant IoU, DSC, and BECE improvements.

Details

Motivation: Existing deep segmentation networks treat all pixels equally, ignoring inhomogeneous sub-regions and poor boundary calibration. Expert annotations vary in confidence, inspiring the ACW loss.

Method: ACW groups sub-regions by confidence, applies region-weighted loss, and dynamically optimizes confidence thresholds. A new metric, BECE, quantifies boundary miscalibration.

Result: ACW outperforms CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation.

Conclusion: ACW effectively leverages expert confidence priors, improving segmentation accuracy and calibration, especially in boundary regions.

Abstract: Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.

Andrea Montibeller, Dasara Shullani, Daniele Baracchi, Alessandro Piva, Giulia Boato

Main category: cs.CV

TL;DR: A framework to emulate social network video compression for improving deepfake detector generalization to real-world scenarios.

Details

Motivation: AI-generated videos on social networks challenge deepfake detectors due to proprietary compression, which removes forensic cues. Existing detectors trained in labs fail in real-world settings.

Method: Proposes a framework to estimate compression and resizing parameters from uploaded videos, enabling local emulation of platform-specific artifacts without API access.

Result: Emulated data closely matches real upload degradation patterns. Detectors fine-tuned on emulated videos perform comparably to those trained on actual shared media.

Conclusion: The framework bridges the gap between lab training and real-world deployment for deepfake detectors, especially for compressed video content.

Abstract: The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content.

Trong-Thuan Nguyen, Viet-Tham Huynh, Quang-Thuc Nguyen, Hoang-Phuc Nguyen, Long Le Bao, Thai Hoang Minh, Minh Nguyen Anh, Thang Nguyen Tien, Phat Nguyen Thuan, Huy Nguyen Phong, Bao Huynh Thai, Vinh-Tiep Nguyen, Duc-Vu Nguyen, Phu-Hoa Pham, Minh-Huy Le-Hoang, Nguyen-Khang Le, Minh-Chinh Nguyen, Minh-Quan Ho, Ngoc-Long Tran, Hien-Long Le-Hoang, Man-Khoi Tran, Anh-Duong Tran, Kim Nguyen, Quan Nguyen Hung, Dat Phan Thanh, Hoang Tran Van, Tien Huynh Viet, Nhan Nguyen Viet Thien, Dinh-Khoi Vo, Van-Loc Nguyen, Trung-Nghia Le, Tam V. Nguyen, Minh-Triet Tran

Main category: cs.CV

TL;DR: ROOMELSA is a benchmark for evaluating 3D retrieval systems in complex, real-world scenarios using natural language and panoramic room images.

Details

Motivation: Existing 3D retrieval systems are limited to simple scenarios, while real-world applications require handling cluttered scenes and vague descriptions.

Method: ROOMELSA evaluates systems by retrieving 3D models from a database based on targeted queries in panoramic room images.

Result: While coarse retrieval is solved, only one top model consistently ranked correct matches first. A CLIP-based model performed well but struggled with subtle variations.

Conclusion: ROOMELSA advances robust 3D recognition by integrating visual and language understanding, setting a new benchmark for real-world applications.

Abstract: Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system’s ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.

[135] DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

Tianyu Xiong, Dayi Tan, Wei Tian

Main category: cs.CV

TL;DR: DiffPose-Animal is a diffusion-based framework for animal pose estimation, leveraging LLMs for semantic guidance and a diffusion decoder for robust predictions.

Details

Motivation: Animal pose estimation is challenging due to morphological diversity, complex structures, and limited data.

Method: Uses diffusion models for denoising, LLMs for anatomical priors, and cross-attention for feature fusion.

Result: Effective and generalizable, especially in diverse, cluttered, or incomplete scenarios.

Conclusion: DiffPose-Animal advances animal pose estimation with robust and biologically meaningful predictions.

Abstract: Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.

[136] Region-Adaptive Video Sharpening via Rate-Perception Optimization

Yingxue Pang, Shijie Zhao, Mengxi Guo, Junlin Li, Li Zhang

Main category: cs.CV

TL;DR: RPO-AdaSharp is a region-adaptive video sharpening model that optimizes perceptual enhancement and bitrate savings by using CTU partition masks to guide bit allocation.

Details

Motivation: Uniform sharpening degrades video quality by ignoring texture variations and increases bitrate without optimal allocation.

Method: Proposes RPO-AdaSharp, an end-to-end model using CTU partition masks to guide bit allocation adaptively.

Result: Benchmark experiments show qualitative and quantitative effectiveness in enhancing video quality and saving bitrate.

Conclusion: RPO-AdaSharp successfully addresses the limitations of uniform sharpening by adaptively allocating bits, improving video quality and efficiency.

Abstract: Sharpening is a widely adopted video enhancement technique. However, uniform sharpening intensity ignores texture variations, degrading video quality. Sharpening also increases bitrate, and there’s a lack of techniques to optimally allocate these additional bits across diverse regions. Thus, this paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening model for both perceptual enhancement and bitrate savings. We use the coding tree unit (CTU) partition mask as prior information to guide and constrain the allocation of increased bits. Experiments on benchmarks demonstrate the effectiveness of the proposed model qualitatively and quantitatively.

[137] MonoPartNeRF:Human Reconstruction from Monocular Video via Part-Based Neural Radiance Fields

Yao Lu, Jiawei Li, Ming Jiang

Main category: cs.CV

TL;DR: MonoPartNeRF improves monocular dynamic human rendering by addressing unnatural transitions and occlusion issues with bidirectional deformation, part-based pose embedding, and dynamic texture modeling.

Details

Motivation: Existing methods struggle with complex pose variations and occlusions, leading to unnatural transitions and inaccurate reconstructions in monocular settings.

Method: Proposes MonoPartNeRF with bidirectional deformation, part-based pose embedding, keyframe pose retrieval, and dynamic texture modeling via attention.

Result: Outperforms prior methods on ZJU-MoCap and MonoCap datasets, achieving better joint alignment, texture fidelity, and structural continuity.

Conclusion: MonoPartNeRF effectively addresses challenges in dynamic human rendering, offering smoother transitions and robust occlusion recovery.

Abstract: In recent years, Neural Radiance Fields (NeRF) have achieved remarkable progress in dynamic human reconstruction and rendering. Part-based rendering paradigms, guided by human segmentation, allow for flexible parameter allocation based on structural complexity, thereby enhancing representational efficiency. However, existing methods still struggle with complex pose variations, often producing unnatural transitions at part boundaries and failing to reconstruct occluded regions accurately in monocular settings. We propose MonoPartNeRF, a novel framework for monocular dynamic human rendering that ensures smooth transitions and robust occlusion recovery. First, we build a bidirectional deformation model that combines rigid and non-rigid transformations to establish a continuous, reversible mapping between observation and canonical spaces. Sampling points are projected into a parameterized surface-time space (u, v, t) to better capture non-rigid motion. A consistency loss further suppresses deformation-induced artifacts and discontinuities. We introduce a part-based pose embedding mechanism that decomposes global pose vectors into local joint embeddings based on body regions. This is combined with keyframe pose retrieval and interpolation, along three orthogonal directions, to guide pose-aware feature sampling. A learnable appearance code is integrated via attention to model dynamic texture changes effectively. Experiments on the ZJU-MoCap and MonoCap datasets demonstrate that our method significantly outperforms prior approaches under complex pose and occlusion conditions, achieving superior joint alignment, texture fidelity, and structural continuity.

[138] Identity-Preserving Aging and De-Aging of Faces in the StyleGAN Latent Space

Luis S. Luevano, Pavel Korshunov, Sebastien Marcel

Main category: cs.CV

TL;DR: The paper proposes a method for aging and de-aging faces using StyleGAN2’s latent space with identity preservation, avoiding complex conditioning and training requirements.

Details

Motivation: Current methods rely on complex conditioning (GANs, Diffusion models, VLMs) leading to training challenges and inconsistent identity preservation.

Method: Uses StyleGAN2’s latent space with SVM for aging/de-aging direction and feature selection to find identity-preserving subspaces.

Result: Empirically identifies identity-preserving subspaces and proposes a formula for parameter limits to ensure identity preservation.

Conclusion: Introduces a public dataset for benchmarking cross-age face recognition and synthetic image detection.

Abstract: Face aging or de-aging with generative AI has gained significant attention for its applications in such fields like forensics, security, and media. However, most state of the art methods rely on conditional Generative Adversarial Networks (GANs), Diffusion-based models, or Visual Language Models (VLMs) to age or de-age faces based on predefined age categories and conditioning via loss functions, fine-tuning, or text prompts. The reliance on such conditioning leads to complex training requirements, increased data needs, and challenges in generating consistent results. Additionally, identity preservation is rarely taken into accountor evaluated on a single face recognition system without any control or guarantees on whether identity would be preserved in a generated aged/de-aged face. In this paper, we propose to synthesize aged and de-aged faces via editing latent space of StyleGAN2 using a simple support vector modeling of aging/de-aging direction and several feature selection approaches. By using two state-of-the-art face recognition systems, we empirically find the identity preserving subspace within the StyleGAN2 latent space, so that an apparent age of a given face can changed while preserving the identity. We then propose a simple yet practical formula for estimating the limits on aging/de-aging parameters that ensures identity preservation for a given input face. Using our method and estimated parameters we have generated a public dataset of synthetic faces at different ages that can be used for benchmarking cross-age face recognition, age assurance systems, or systems for detection of synthetic images. Our code and dataset are available at the project page https://www.idiap.ch/paper/agesynth/

[139] Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment

Shi-Chen Zhang, Yunheng Li, Yu-Huan Wu, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: The paper proposes a dual-branch offset learning paradigm to address misalignment in semantic segmentation, improving efficiency without major architectural changes.

Details

Motivation: Existing lightweight semantic segmentation methods suffer from misalignment between class representations and image features due to per-pixel classification.

Method: A coupled dual-branch offset learning paradigm is introduced to dynamically refine class representations and spatial image features.

Result: Experiments on four datasets show consistent improvements, e.g., 2.7% mIoU boost on ADE20K with minimal added parameters.

Conclusion: The offset learning paradigm effectively enhances semantic segmentation efficiency and can be integrated into existing methods.

Abstract: Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.

[140] TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models

Yuqi Peng, Lingtao Zheng, Yufeng Yang, Yi Huang, Mingfu Yan, Jianzhuang Liu, Shifeng Chen

Main category: cs.CV

TL;DR: The paper introduces Token-Aware LoRA (TARA) to address issues in multi-concept text-to-image generation, preventing identity loss and feature leakage by using token masks and spatial alignment.

Details

Motivation: Current LoRA-based methods for multi-concept generation suffer from identity missing and visual feature leakage due to token-wise interference and spatial misalignment.

Method: Proposes TARA, which uses token masks to constrain module focus and a training objective for spatial alignment, enabling training-free multi-concept composition.

Result: TARA effectively preserves visual identity and avoids interference between LoRA modules, enabling efficient multi-concept inference.

Conclusion: TARA improves multi-concept text-to-image generation by addressing key issues in LoRA-based methods, offering a practical solution for preserving concept identities.

Abstract: Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at https://github.com/YuqiPeng77/TARA.

[141] 3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

Noor Ahmed, Cameron Braunstein, Steffen Eger, Eddy Ilg

Main category: cs.CV

TL;DR: 3DFroMLLM is a novel framework that generates 3D object prototypes from MLLMs, improving spatial reasoning and outperforming previous methods by 15%. It also enhances fine-grained vision-language models by 55% without extra human-labeled data.

Details

Motivation: Existing MLLMs lack strong spatial reasoning capabilities, limiting their ability to generate 3D object prototypes.

Method: The framework uses an agentic pipeline (designer, coder, visual inspector) in a refinement loop to generate 3D prototypes without additional training data or detailed instructions.

Result: The framework outperforms previous methods by 15% in image classification pretraining and improves fine-grained vision-language models by 55% in part segmentation.

Conclusion: 3DFroMLLM advances MLLMs by enabling 3D prototype generation and enhancing vision-language tasks without requiring extra labeled data.

Abstract: Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that enables the generation of 3D object prototypes directly from MLLMs, including geometry and part labels. Our pipeline is agentic, comprising a designer, coder, and visual inspector operating in a refinement loop. Notably, our approach requires no additional training data or detailed user instructions. Building on prior work in 2D generation, we demonstrate that rendered images produced by our framework can be effectively used for image classification pretraining tasks and outperforms previous methods by 15%. As a compelling real-world use case, we show that the generated prototypes can be leveraged to improve fine-grained vision-language models by using the rendered, part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a 55% accuracy improvement without relying on any additional human-labeled data.

[142] A Parametric Bi-Directional Curvature-Based Framework for Image Artifact Classification and Quantification

Diego Frias

Main category: cs.CV

TL;DR: A novel No-Reference Image Quality Assessment (NR-IQA) framework using Anisotropic Texture Richness (ATR) achieves high accuracy in artifact classification and quality prediction.

Details

Motivation: To improve image quality assessment by analyzing directional curvature and texture suppression for better artifact classification and degradation quantification.

Method: Defines ATR with tunable thresholds, optimizes it for specific artifacts, and uses a two-stage system for classification and regression.

Result: Achieves Spearman correlations of -0.93 (blur) and -0.95 (noise), 97% classification accuracy, and high predictive accuracy (R2=0.892, RMSE=5.17).

Conclusion: The framework is a robust dual-purpose tool for image degradation classification and quantification.

Abstract: This work presents a novel framework for No-Reference Image Quality Assessment (NR-IQA) founded on the analysis of directional image curvature. Within this framework, we define a measure of Anisotropic Texture Richness (ATR), which is computed at the pixel level using two tunable thresholds – one permissive and one restrictive – that quantify orthogonal texture suppression. When its parameters are optimized for a specific artifact, the resulting ATR score serves as a high-performance quality metric, achieving Spearman correlations with human perception of approximately -0.93 for Gaussian blur and -0.95 for white noise on the LIVE dataset. The primary contribution is a two-stage system that leverages the differential response of ATR to various distortions. First, the system utilizes the signature from two specialist ATR configurations to classify the primary artifact type (blur vs. noise) with over 97% accuracy. Second, following classification, it employs a dedicated regression model mapping the relevant ATR score to a quality rating to quantify the degradation. On a combined dataset, the complete system predicts human scores with a coefficient of determination (R2) of 0.892 and a Root Mean Square Error (RMSE) of 5.17 DMOS points. This error corresponds to just 7.4% of the dataset’s total quality range, demonstrating high predictive accuracy. This establishes our framework as a robust, dual-purpose tool for the classification and subsequent quantification of image degradation.

[143] Adaptive High-Frequency Preprocessing for Video Coding

Yingxue Pang, Shijie Zhao, Junlin Li, Li Zhang

Main category: cs.CV

TL;DR: An end-to-end learning framework (FFPN) optimizes high-frequency preprocessing in video coding to balance quality and bitrate, improving subjective quality and reducing costs.

Details

Motivation: High-frequency components affect video clarity and bitrate, necessitating a solution to balance quality and efficiency.

Method: Uses the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict optimal preprocessing strategies, trained with pseudo-labeled videos based on rate-distortion performance.

Result: The framework enhances video quality while saving bitrate, validated on multiple datasets.

Conclusion: The proposed method effectively balances video quality and bitrate, offering practical benefits for coding efficiency.

Abstract: High-frequency components are crucial for maintaining video clarity and realism, but they also significantly impact coding bitrate, resulting in increased bandwidth and storage costs. This paper presents an end-to-end learning-based framework for adaptive high-frequency preprocessing to enhance subjective quality and save bitrate in video coding. The framework employs the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the optimal high-frequency preprocessing strategy, guiding subsequent filtering operators to achieve the optimal tradeoff between bitrate and quality after compression. For training FFPN, we pseudo-label each training video with the optimal strategy, determined by comparing the rate-distortion (RD) performance across different preprocessing types and strengths. Distortion is measured using the latest quality assessment metric. Comprehensive evaluations on multiple datasets demonstrate the visually appealing enhancement capabilities and bitrate savings achieved by our framework.

[144] GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments

Lin Zeng, Boming Zhao, Jiarui Hu, Xujie Shen, Ziqiang Dang, Hujun Bao, Zhaopeng Cui

Main category: cs.CV

TL;DR: GaussianUpdate combines 3D Gaussian representation with continual learning for adaptive novel view synthesis, handling scene changes efficiently without extensive retraining.

Details

Motivation: Existing methods for adapting neural models to scene changes are either labor-intensive or fail to capture detailed changes over time.

Method: Uses 3D Gaussian representation and a multi-stage update strategy, along with visibility-aware continual learning and generative replay.

Result: Achieves superior, real-time rendering and visualizes changes over time on benchmark datasets.

Conclusion: GaussianUpdate effectively updates scenes while preserving past information, offering a scalable solution for dynamic environments.

Abstract: Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times

[145] Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos

Chaoyi Wang, Yifan Yang, Jun Pei, Lijie Xia, Jianpo Liu, Xiaobing Yuan, Xinhan Di

Main category: cs.CV

TL;DR: The paper introduces WB-DH, a benchmark dataset for evaluating whole-body animatable avatar generation, addressing gaps in current datasets and metrics.

Details

Motivation: Current datasets and metrics lack the ability to capture subtle expressions, body movements, and dynamic backgrounds in avatar generation.

Method: The authors propose WB-DH, an open-source, multi-modal benchmark with detailed annotations and a versatile evaluation framework.

Result: WB-DH provides a comprehensive toolset and dataset for evaluating whole-body avatar generation, publicly accessible.

Conclusion: WB-DH bridges the gap in evaluating animatable avatars, offering a robust benchmark for future research.

Abstract: Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.

[146] A Robust Epipolar-Domain Regularization Algorithm for Light Field Depth Estimation

Noor Islam S. Mohammad

Main category: cs.CV

TL;DR: A lightweight depth estimation method for light field imaging combines disparity information with a random walk algorithm, offering efficiency and robustness without heavy reliance on deep learning.

Details

Motivation: Addressing the high computational costs and noise sensitivity of existing CNN-based depth estimation methods in light field imaging.

Method: Integrates light field disparity with a directed random walk refinement algorithm, avoiding extensive training or large datasets.

Result: Maintains low computational complexity and competitive accuracy, though slightly less effective in uncontrolled conditions.

Conclusion: Proposes a robust, efficient alternative to deep learning for depth estimation, with potential for further integration of probabilistic graph models.

Abstract: Robust depth estimation in light field imaging remains a critical challenge for pattern recognition applications such as augmented reality, biomedical imaging, and scene reconstruction. While existing approaches often rely heavily on deep convolutional neural networks, they tend to incur high computational costs and struggle in noisy real-world environments. This paper proposes a novel lightweight depth estimation pipeline that integrates light field-based disparity information with a directed random walk refinement algorithm. Unlike traditional CNN-based methods, our approach enhances depth map consistency without requiring extensive training or large-scale datasets. The proposed method was evaluated on the 4D Light Field Benchmark dataset and a diverse set of real-world images. Experimental results indicate that while performance slightly declines under uncontrolled conditions, the algorithm consistently maintains low computational complexity and competitive accuracy compared to state-of-the-art deep learning models. These findings highlight the potential of our method as a robust and efficient alternative for depth estimation and segmentation in light field imaging. The work provides insights into practical algorithm design for light field-based pattern recognition and opens new directions for integrating probabilistic graph models with depth sensing frameworks.

[147] Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

Bin Ren, Xiaoshui Huang, Mengyuan Liu, Hong Liu, Fabio Poiesi, Nicu Sebe, Guofeng Mei

Main category: cs.CV

TL;DR: MaskClu is an unsupervised pre-training method for vision transformers (ViTs) on 3D point clouds, combining masked point modeling with clustering-based learning and global contrastive learning to enhance semantic feature extraction.

Details

Motivation: Standard ViTs struggle to learn dense and informative semantic features from 3D point clouds, prompting the need for a more effective pre-training approach.

Method: MaskClu integrates masked point modeling with clustering-based learning, reconstructing cluster assignments and centers, and employs global contrastive learning to enhance instance-level features.

Result: MaskClu achieves competitive results in 3D tasks like part segmentation, semantic segmentation, object detection, and classification.

Conclusion: MaskClu effectively learns richer and more semantically meaningful representations from 3D point clouds, demonstrating its potential for advancing 3D understanding tasks.

Abstract: Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.

[148] Automatic and standardized surgical reporting for central nervous system tumors

David Bouget, Mathilde Gajda Faanes, Asgeir Store Jakola, Frederik Barkhof, Hilko Ardon, Lorenzo Bello, Mitchel S. Berger, Shawn L. Hervey-Jumper, Julia Furtner, Albert J. S. Idema, Barbara Kiesel, Georg Widhalm, Rishi Nandoe Tewarie, Emmanuel Mandonnet, Pierre A. Robe, Michiel Wagemakers, Timothy R. Smith, Philip C. De Witt Hamer, Ole solheim, Ingerid Reinertsen

Main category: cs.CV

TL;DR: The study introduces a pipeline for automated postoperative CNS tumor analysis using Attention U-Net and DenseNet models, achieving high accuracy in segmentation and classification, integrated into Raidionics software.

Details

Motivation: Most automated tumor analysis focuses on preoperative data, leaving a gap in postoperative imaging evaluation. This study aims to standardize and automate postsurgical reporting for CNS tumors.

Method: The pipeline uses Attention U-Net for segmentation (tumor core, residual tumor, resection cavity) and DenseNet for MR sequence and tumor type classification, trained on multicentric datasets with 5-fold cross-validation.

Result: Segmentation models achieved Dice scores of 87%, 66%, 70%, and 77% for different tumor components. Classification models reached 99.5% and 80% accuracy for MR sequence and tumor type, respectively.

Conclusion: The pipeline enhances postoperative evaluation and clinical decision-making, integrated into Raidionics for open-source CNS tumor analysis.

Abstract: Magnetic resonance (MR) imaging is essential for evaluating central nervous system (CNS) tumors, guiding surgical planning, treatment decisions, and assessing postoperative outcomes and complication risks. While recent work has advanced automated tumor segmentation and report generation, most efforts have focused on preoperative data, with limited attention to postoperative imaging analysis. This study introduces a comprehensive pipeline for standardized postsurtical reporting in CNS tumors. Using the Attention U-Net architecture, segmentation models were trained for the preoperative (non-enhancing) tumor core, postoperative contrast-enhancing residual tumor, and resection cavity. Additionally, MR sequence classification and tumor type identification for contrast-enhancing lesions were explored using the DenseNet architecture. The models were integrated into a reporting pipeline, following the RANO 2.0 guidelines. Training was conducted on multicentric datasets comprising 2000 to 7000 patients, using a 5-fold cross-validation. Evaluation included patient-, voxel-, and object-wise metrics, with benchmarking against the latest BraTS challenge results. The segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models reached 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification. The pipeline presented in this study enables robust, automated segmentation, MR sequence classification, and standardized report generation aligned with RANO 2.0 guidelines, enhancing postoperative evaluation and clinical decision-making. The proposed models and methods were integrated into Raidionics, open-source software platform for CNS tumor analysis, now including a dedicated module for postsurgical analysis.

[149] A Pseudo Global Fusion Paradigm-Based Cross-View Network for LiDAR-Based Place Recognition

Jintao Cheng, Jiehao Luo, Xieyuanli Chen, Jin Wu, Rui Fan, Xiaoyu Tang, Wei Zhang

Main category: cs.CV

TL;DR: A novel cross-view network for LiDAR-based Place Recognition (LPR) uses a pseudo-global information guidance mechanism and a Manifold Adaptation Metric to improve performance in complex environments.

Details

Motivation: Existing LPR methods rely on Euclidean distance, ignoring intrinsic feature structures and intra-class variances, leading to suboptimal performance in complex scenarios.

Method: Proposes a cross-view network with multi-modal fusion and a Manifold Adaptation Metric using Mahalanobis distance for feature learning.

Result: The method outperforms traditional Euclidean-based approaches, especially in complex and temporal-varying environments.

Conclusion: The framework effectively captures nonlinear data distributions and inter-class dependencies, enhancing LPR performance.

Abstract: LiDAR-based Place Recognition (LPR) remains a critical task in Embodied Artificial Intelligence (AI) and Autonomous Driving, primarily addressing localization challenges in GPS-denied environments and supporting loop closure detection. Existing approaches reduce place recognition to a Euclidean distance-based metric learning task, neglecting the feature space’s intrinsic structures and intra-class variances. Such Euclidean-centric formulation inherently limits the model’s capacity to capture nonlinear data distributions, leading to suboptimal performance in complex environments and temporal-varying scenarios. To address these challenges, we propose a novel cross-view network based on an innovative fusion paradigm. Our framework introduces a pseudo-global information guidance mechanism that coordinates multi-modal branches to perform feature learning within a unified semantic space. Concurrently, we propose a Manifold Adaptation and Pairwise Variance-Locality Learning Metric that constructs a Symmetric Positive Definite (SPD) matrix to compute Mahalanobis distance, superseding traditional Euclidean distance metrics. This geometric formulation enables the model to accurately characterize intrinsic data distributions and capture complex inter-class dependencies within the feature space. Experimental results demonstrate that the proposed algorithm achieves competitive performance, particularly excelling in complex environmental conditions.

[150] Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions

Miruna-Alexandra Gafencu, Reem Shaban, Yordanka Velikova, Mohammad Farid Azampour, Nassir Navab

Main category: cs.CV

TL;DR: A novel robotic ultrasound system with real-time shape completion enhances spinal visualization by overcoming shadowing artifacts and traditional CT-to-US limitations.

Details

Motivation: Ultrasound imaging for spinal procedures is hindered by shadowing artifacts and traditional CT-to-US registration challenges.

Method: Combines robotic ultrasound with deep learning-based shape completion to autonomously acquire US sweeps, extract vertebral surfaces, and reconstruct anatomy.

Result: Validated through quantitative experiments on shape completion accuracy and qualitative results on volunteer scans, showing improved consistency and reproducibility.

Conclusion: The system offers real-time, interactive visualization and autonomous repeatability, enhancing spinal procedure accuracy and understanding.

Abstract: Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.

[151] Accelerated Volumetric Compression without Hierarchies: A Fourier Feature Based Implicit Neural Representation Approach

Leona Žůrková, Petr Strakoš, Michal Kravčenko, Tomáš Brzobohatý, Lubomír Říha

Main category: cs.CV

TL;DR: A neural compression method for volumetric data using Fourier features and selective voxel sampling reduces training time by 63.7% with minimal quality loss.

Details

Motivation: Volumetric data compression is essential in fields like medical imaging and entertainment, but traditional methods often involve redundant computations and hierarchical metadata.

Method: Combines Fourier-feature encoding with dynamic voxel selection via morphological dilation to prioritize active regions, eliminating redundant computation.

Result: Training time reduced by 63.7%, minor quality loss (PSNR drop 0.59 dB, SSIM drop 0.008), and a compression rate of 14.

Conclusion: The method offers a scalable, structure-free solution for efficient volumetric compression, eliminating traditional data-loading overhead.

Abstract: Volumetric data compression is critical in fields like medical imaging, scientific simulation, and entertainment. We introduce a structure-free neural compression method combining Fourierfeature encoding with selective voxel sampling, yielding compact volumetric representations and faster convergence. Our dynamic voxel selection uses morphological dilation to prioritize active regions, reducing redundant computation without any hierarchical metadata. In the experiment, sparse training reduced training time by 63.7 % (from 30 to 11 minutes) with only minor quality loss: PSNR dropped 0.59 dB (from 32.60 to 32.01) and SSIM by 0.008 (from 0.948 to 0.940). The resulting neural representation, stored solely as network weights, achieves a compression rate of 14 and eliminates traditional data-loading overhead. This connects coordinate-based neural representation with efficient volumetric compression, offering a scalable, structure-free solution for practical applications.

[152] MADPromptS: Unlocking Zero-Shot Morphing Attack Detection with Multiple Prompt Aggregation

Eduarda Caldeira, Fadi Boutros, Naser Damer

Main category: cs.CV

TL;DR: The paper explores a zero-shot approach for Face Morphing Attack Detection (MAD) using CLIP without fine-tuning, improving performance through prompt aggregation.

Details

Motivation: Addressing the challenge of MAD in face recognition security by leveraging CLIP's zero-shot capabilities without task-specific training.

Method: Uses CLIP with multiple textual prompts per class, aggregating their embeddings to align with MAD task requirements.

Result: Prompt aggregation significantly enhances zero-shot MAD detection performance.

Conclusion: Demonstrates the potential of foundation models like CLIP for generalizable MAD solutions through efficient prompt engineering.

Abstract: Face Morphing Attack Detection (MAD) is a critical challenge in face recognition security, where attackers can fool systems by interpolating the identity information of two or more individuals into a single face image, resulting in samples that can be verified as belonging to multiple identities by face recognition systems. While multimodal foundation models (FMs) like CLIP offer strong zero-shot capabilities by jointly modeling images and text, most prior works on FMs for biometric recognition have relied on fine-tuning for specific downstream tasks, neglecting their potential for direct, generalizable deployment. This work explores a pure zero-shot approach to MAD by leveraging CLIP without any additional training or fine-tuning, focusing instead on the design and aggregation of multiple textual prompts per class. By aggregating the embeddings of diverse prompts, we better align the model’s internal representations with the MAD task, capturing richer and more varied cues indicative of bona-fide or attack samples. Our results show that prompt aggregation substantially improves zero-shot detection performance, demonstrating the effectiveness of exploiting foundation models’ built-in multimodal knowledge through efficient prompt engineering.

[153] UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition

Wenhan Wu, Zhishuai Guo, Chen Chen, Aidong Lu

Main category: cs.CV

TL;DR: A lightweight transformer framework for skeleton-based action recognition integrates spatio-temporal modeling in a single module, reducing redundancy and computational costs while maintaining accuracy.

Details

Motivation: Existing transformer-based methods for skeleton-based action recognition are complex, computationally expensive, and lack scalability.

Method: Proposes a unified spatio-temporal lightweight transformer with a single attention module and a simplified multi-scale pooling fusion module.

Result: Reduces parameters by 58% and computational cost by 60% while maintaining competitive recognition performance.

Conclusion: The framework achieves a superior balance between efficiency and accuracy, addressing limitations of existing methods.

Abstract: Skeleton-based action recognition (SAR) has achieved impressive progress with transformer architectures. However, existing methods often rely on complex module compositions and heavy designs, leading to increased parameter counts, high computational costs, and limited scalability. In this paper, we propose a unified spatio-temporal lightweight transformer framework that integrates spatial and temporal modeling within a single attention module, eliminating the need for separate temporal modeling blocks. This approach reduces redundant computations while preserving temporal awareness within the spatial modeling process. Furthermore, we introduce a simplified multi-scale pooling fusion module that combines local and global pooling pathways to enhance the model’s ability to capture fine-grained local movements and overarching global motion patterns. Extensive experiments on benchmark datasets demonstrate that our lightweight model achieves a superior balance between accuracy and efficiency, reducing parameter complexity by over 58% and lowering computational cost by over 60% compared to state-of-the-art transformer-based baselines, while maintaining competitive recognition performance.

[154] Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, Zhanjie Zhang

Main category: cs.CV

TL;DR: The paper introduces Layout-Togglable Storytelling, leveraging layout conditions for fine-grained control in storytelling tasks, and presents Lay2Story-1M dataset and Lay2Story framework, outperforming SOTA methods.

Details

Motivation: Existing methods struggle with subject consistency and precise control due to lack of fine-grained guidance and high-quality data.

Method: Proposes Layout-Togglable Storytelling with layout conditions, introduces Lay2Story-1M dataset, and develops Lay2Story framework based on DiTs.

Result: Outperforms SOTA in consistency, semantic correlation, and aesthetic quality.

Conclusion: Layout conditions and Lay2Story framework significantly improve storytelling tasks, validated by experiments.

Abstract: Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject’s position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject’s position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject’s position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Togglable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable Storytelling tasks. Through both qualitative and quantitative experiments, we find that our method outperforms the previous state-of-the-art (SOTA) techniques, achieving the best results in terms of consistency, semantic correlation, and aesthetic quality.

[155] Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

Elman Ghazaei, Erchan Aptoula

Main category: cs.CV

TL;DR: The paper introduces a new dataset, BrightVQA, and a model, TCSSM, to address domain shift in Change Detection Visual Question Answering (CDVQA), outperforming existing methods.

Details

Motivation: Traditional change detection methods require expert knowledge, and existing CDVQA methods assume similar training-testing distributions, which is unrealistic due to domain shifts.

Method: Proposes a Text-Conditioned State Space Model (TCSSM) that integrates bi-temporal imagery and geo-disaster-related text to extract domain-invariant features.

Result: TCSSM demonstrates superior performance in experiments compared to state-of-the-art models.

Conclusion: The work advances CDVQA by addressing domain shift, with the dataset and model made publicly available.

Abstract: The Earth’s surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.

[156] TaoCache: Structure-Maintained Video Generation Acceleration

Zhentao Fan, Zongzuo Wang, Weiwei Zhang

Main category: cs.CV

TL;DR: TaoCache is a training-free caching strategy for video diffusion models that improves visual quality by focusing on late denoising stages and calibrating noise deltas.

Details

Motivation: Existing cache-based methods cause structural discrepancies and hinder instruction following and character consistency by skipping early or mid denoising steps.

Method: TaoCache adopts a fixed-point perspective to predict noise output, calibrating cosine similarities and norm ratios of consecutive noise deltas for aggressive skipping while preserving structure.

Result: TaoCache achieves higher visual quality (LPIPS, SSIM, PSNR) than prior methods under the same speedups, as tested on Latte-1, OpenSora-Plan v110, and Wan2.1.

Conclusion: TaoCache is an effective, plug-and-play solution for accelerating video diffusion models without compromising visual quality, and it integrates well with other acceleration techniques.

Abstract: Existing cache-based acceleration methods for video diffusion models primarily skip early or mid denoising steps, which often leads to structural discrepancies relative to full-timestep generation and can hinder instruction following and character consistency. We present TaoCache, a training-free, plug-and-play caching strategy that, instead of residual-based caching, adopts a fixed-point perspective to predict the model’s noise output and is specifically effective in late denoising stages. By calibrating cosine similarities and norm ratios of consecutive noise deltas, TaoCache preserves high-resolution structure while enabling aggressive skipping. The approach is orthogonal to complementary accelerations such as Pyramid Attention Broadcast (PAB) and TeaCache, and it integrates seamlessly into DiT-based frameworks. Across Latte-1, OpenSora-Plan v110, and Wan2.1, TaoCache attains substantially higher visual quality (LPIPS, SSIM, PSNR) than prior caching methods under the same speedups.

[157] ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation

Ding Xia, Naoto Inoue, Qianru Qiu, Kotaro Kikuchi

Main category: cs.CV

TL;DR: The paper explores using pretrained LLMs for color recommendation in design, introducing ColorGPT, a pipeline that outperforms traditional methods in accuracy and diversity.

Details

Motivation: Traditional color recommendation methods struggle with complexity and data limitations, prompting the exploration of LLMs for superior design suggestions.

Method: Developed ColorGPT, a pipeline using LLMs, tested color representations, and applied prompt engineering for color palette completion and full palette generation.

Result: ColorGPT outperformed existing methods in accuracy and color distribution for palette completion and improved diversity and similarity in full palette generation.

Conclusion: Pretrained LLMs, like ColorGPT, are effective for color recommendation tasks, offering better performance than traditional methods.

Abstract: Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating communication, improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (LLMs) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained LLMs serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our LLM-based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.

[158] KFFocus: Highlighting Keyframes for Enhanced Video Understanding

Ming Nie, Chunwei Wang, Hang Xu, Li Zhang

Main category: cs.CV

TL;DR: KFFocus improves video LLMs by efficiently compressing video tokens and emphasizing keyframes, outperforming existing methods in computational efficiency and accuracy.

Details

Motivation: Current video LLMs use uniform sampling and frame condensation, risking omission of keyframes with essential details due to uneven temporal information distribution.

Method: KFFocus replaces uniform sampling with a refined approach to identify keyframes based on temporal redundancy and assigns varying condensation ratios. It includes a spatiotemporal modeling module for nuanced understanding.

Result: KFFocus outperforms existing methods on video understanding benchmarks, especially in long video scenarios, with improved computational efficiency and accuracy.

Conclusion: KFFocus effectively addresses the limitations of current video LLMs by optimizing token compression and emphasizing keyframes, enhancing both efficiency and performance.

Abstract: Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.

[159] Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation

Zan Wang, Jingze Zhang, Yixin Chen, Baoxiong Jia, Wei Liang, Siyuan Huang

Main category: cs.CV

TL;DR: MSQ introduces a multi-scale quantization method for human motion generation, addressing limitations in current representations by enabling flexible composition and multi-scale modeling.

Details

Motivation: Current motion representations lack multi-scale perspective and compositional flexibility, limiting complex pattern modeling and generalization.

Method: MSQ compresses motion sequences into multi-scale discrete tokens using distinct encoders for spatial granularities and temporal interpolation, followed by quantization.

Result: The method outperforms baselines, enabling seamless token composition without specialized design or re-training.

Conclusion: MSQ provides an effective solution for motion editing, control, and conditional generation, demonstrating superior performance.

Abstract: Despite significant advancements in human motion generation, current motion representations, typically formulated as discrete frame sequences, still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model’s generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.

[160] UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale

Yuhao Wang, Wei Xi

Main category: cs.CV

TL;DR: The paper proposes UniConvNet, a universal ConvNet model that expands the effective receptive field (ERF) efficiently by combining smaller kernels while maintaining asymptotically Gaussian distribution (AGD). It outperforms state-of-the-art CNNs and ViTs on various tasks.

Details

Motivation: Large ERF ConvNets face high computational costs and disrupted AGD. The paper aims to address this by optimizing ERF expansion with smaller kernels.

Method: Introduces a Three-layer Receptive Field Aggregator and a Layer Operator to expand ERF while preserving AGD. Proposes UniConvNet as a universal model for any scale.

Result: UniConvNet achieves 84.2% top-1 accuracy on ImageNet with 30M parameters and 5.1G FLOPs, and scales well to larger models (88.4% accuracy).

Conclusion: UniConvNet is an efficient and scalable alternative to large-kernel ConvNets, maintaining AGD and outperforming existing models.

Abstract: Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4%$ top-1 accuracy on ImageNet. Code and models are publicly available at https://github.com/ai-paperwithcode/UniConvNet.

[161] Towards Perfection: Building Inter-component Mutual Correction for Retinex-based Low-light Image Enhancement

Luyang Cao, Han Xu, Jian Zhang, Lei Qi, Jiayi Ma, Yinghuan Shi, Yang Gao

Main category: cs.CV

TL;DR: The paper introduces Inter-component residuals (ICR) in Retinex-based low-light image enhancement, proposes the IRetinex model to reduce ICR, and demonstrates superior performance over existing methods.

Details

Motivation: Previous Retinex-based methods underestimate residuals (ICR) after decomposition, which degrade image quality. Addressing ICR can improve decomposition and enhancement accuracy.

Method: Proposes IRetinex with two stages: decomposition (reduces ICR via feature similarity reduction) and enhancement (mitigates ICR impact using feature similarity).

Result: Extensive experiments on benchmark datasets show IRetinex outperforms state-of-the-art methods in quality and metrics.

Conclusion: Reducing ICR via IRetinex significantly improves low-light image enhancement, offering better decomposition and synthesis outcomes.

Abstract: In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.

[162] Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation

Kaiwen Huang, Tao Zhou, Huazhu Fu, Yizhe Zhang, Yi Zhou, Xiao-Jun Wu

Main category: cs.CV

TL;DR: UC-Seg is a semi-supervised medical image segmentation framework that addresses cognitive biases and pseudo-label generation challenges by using cross-subnet consistency and uncertainty-aware methods, outperforming existing techniques.

Details

Motivation: To reduce reliance on expert annotations and address limitations of existing mean-teacher methods, such as cognitive biases and unreliable pseudo-labels.

Method: UC-Seg employs two subnets with Cross-subnet Consistency Preservation (CCP) and Uncertainty-aware Pseudo-label Generation (UPG) to enhance feature representation and generate high-confidence pseudo-labels.

Result: UC-Seg achieves superior segmentation accuracy and generalization across various medical image modalities (MRI, CT, ultrasound, etc.).

Conclusion: The proposed framework effectively mitigates biases and improves segmentation performance, demonstrating its potential for semi-supervised medical image analysis.

Abstract: Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at https://github.com/taozh2017/UCSeg.

[163] When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges

Zhiqiang Yang, Renshuai Tao, Xiaolong Zheng, Guodong Yang, Chunjie Zhang

Main category: cs.CV

TL;DR: DPGNet addresses deepfake detection challenges by leveraging unlabeled data and bridging domain gaps, outperforming SoTA by 6.3%.

Details

Motivation: Human annotators struggle to label deepfakes due to their realism, making labeled data unreliable and scarce.

Method: DPGNet uses text-guided cross-domain alignment and curriculum-driven pseudo label generation, with cross-domain knowledge distillation.

Result: DPGNet outperforms state-of-the-art methods by 6.3% on 11 datasets.

Conclusion: DPGNet effectively leverages unlabeled data to tackle deepfake detection challenges caused by realistic AI-generated content.

Abstract: Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by \textbf{6.3%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.

[164] Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

Maxim A. Patratskiy, Alexey K. Kovalev, Aleksandr I. Panov

Main category: cs.CV

TL;DR: The paper introduces a novel Vision-Language-Action model integrating spatial and temporal understanding via visual prompting, improving task success rates with minimal training data.

Details

Motivation: To enhance agent movement prediction by combining spatial and temporal understanding, addressing limitations of prior independent approaches.

Method: Projects visual traces of key points onto depth maps to simultaneously capture spatial and temporal information.

Result: Experiments in SimplerEnv show a 4% improvement over SpatialVLA and 19% over TraceVLA in task success rates.

Conclusion: The method effectively integrates spatial and temporal data, requiring minimal training, making it practical for real-world applications.

Abstract: Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA.

[165] Per-Query Visual Concept Learning

Ori Malca, Dvir Samuel, Gal Chechik

Main category: cs.CV

TL;DR: The paper introduces a method to enhance visual concept learning by adding a prompt- and noise-specific personalization step using self- and cross-attention losses, improving semantic similarity.

Details

Motivation: To improve the personalization of pretrained models for applications like product placement and entertainment by addressing the limitations of existing methods.

Method: Augments existing methods with a personalization step using PDM features and two loss terms (self- and cross-attention) to capture concept identity.

Result: Significant improvements over six personalization methods and various base models, including UNet- and DiT-based ones.

Conclusion: The proposed method effectively enhances personalized semantic similarity, outperforming previous per-query personalization approaches.

Abstract: Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features – previously designed to capture identity – and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.

[166] ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds

Shanle Yao, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

Main category: cs.CV

TL;DR: An active learning framework for Video Anomaly Detection (VAD) improves adaptability in dynamic environments by leveraging human-in-the-loop feedback and adaptive thresholds.

Details

Motivation: VAD struggles in real-world settings due to dynamic human actions and environmental variations, requiring better evaluation metrics and adaptability.

Method: Introduces an active learning framework with human-in-the-loop feedback to refine pseudo-labels and define adaptive thresholds for different environments.

Result: Achieves an EBI of 68.91 in simulated real-world scenarios, demonstrating practical effectiveness.

Conclusion: The framework enhances VAD’s applicability in dynamic settings by improving adaptability and accuracy.

Abstract: Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of ’normal’ shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.

[167] VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Fuhao Chang, Shuxin Li, Yabei Li, Lei He

Main category: cs.CV

TL;DR: VLM-3D is an end-to-end framework using Visual Language Models (VLMs) for 3D geometric perception in autonomous driving, improving accuracy by 12.8% with a joint semantic-geometric loss design.

Details

Motivation: Addressing the challenge of open-set perception in autonomous driving, especially for unseen object categories, by leveraging VLMs' semantic reasoning capabilities.

Method: Proposes VLM-3D, integrating Low-Rank Adaptation (LoRA) for efficient VLM adaptation and a joint semantic-geometric loss (token-level semantic loss early, 3D IoU loss later) for stable convergence and refined accuracy.

Result: Achieves a 12.8% improvement in perception accuracy on the nuScenes dataset.

Conclusion: VLM-3D effectively combines VLMs with 3D perception, advancing open-set perception in autonomous driving.

Abstract: Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.

[168] Scaling Learned Image Compression Models up to 1 Billion

Yuqi Li, Haotian Zhang, Li Li, Dong Liu, Feng Wu

Main category: cs.CV

TL;DR: This paper explores scaling up learned image compression models, revealing performance trends via scaling laws, and shows that larger models achieve state-of-the-art results.

Details

Motivation: Current learned image compression models are limited in scale, and the impact of scaling on performance is unexplored.

Method: The study scales the HPCM model from 68.5M to 1B parameters, fitting power-law relations between test loss and scaling variables (model size, training compute).

Result: The scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance, revealing a scaling trend.

Conclusion: This work encourages future research on large-scale compression models and the link between compression and intelligence.

Abstract: Recent advances in large language models (LLMs) highlight a strong connection between intelligence and compression. Learned image compression, a fundamental task in modern data compression, has made significant progress in recent years. However, current models remain limited in scale, restricting their representation capacity, and how scaling model size influences compression performance remains unexplored. In this work, we present a pioneering study on scaling up learned image compression models and revealing the performance trends through scaling laws. Using the recent state-of-the-art HPCM model as baseline, we scale model parameters from 68.5 millions to 1 billion and fit power-law relations between test loss and key scaling variables, including model size and optimal training compute. The results reveal a scaling trend, enabling extrapolation to larger scale models. Experimental results demonstrate that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance. We hope this work inspires future exploration of large-scale compression models and deeper investigations into the connection between compression and intelligence.

[169] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision

Ahsan Habib Akash, Greg Murray, Annahita Amireskandari, Joel Palko, Carol Laxson, Binod Bhattarai, Prashnna Gyawali

Main category: cs.CV

TL;DR: The paper introduces an attribute-agnostic debiasing method for Vision-Language Models (VLMs) in glaucoma screening, using unsupervised clustering and weighted contrastive learning to reduce demographic biases.

Details

Motivation: VLMs exhibit demographic biases even without explicit protected attributes during training, which is problematic for critical applications like glaucoma screening that disproportionately affect underserved populations.

Method: The method involves (i) inferring proxy subgroups via unsupervised clustering, (ii) computing gradient-similarity weights between multimodal and image-pair contrastive losses, and (iii) applying these weights in a joint objective to upweight underperforming clusters.

Result: The method is evaluated on the Harvard FairVLMed glaucoma subset, showing improved equitable performance across inferred demographic subgroups using metrics like EOD, ES AUC, and Groupwise AUC.

Conclusion: The proposed label-free approach effectively reduces subgroup disparities in glaucoma screening by adaptively targeting underperforming clusters.

Abstract: Vision-Language Models (VLMs) have achieved remarkable success on multimodal tasks such as image-text retrieval and zero-shot classification, yet they can exhibit demographic biases even when explicit protected attributes are absent during training. In this work, we focus on automated glaucoma screening from retinal fundus images, a critical application given that glaucoma is a leading cause of irreversible blindness and disproportionately affects underserved populations. Building on a reweighting-based contrastive learning framework, we introduce an attribute-agnostic debiasing method that (i) infers proxy subgroups via unsupervised clustering of image-image embeddings, (ii) computes gradient-similarity weights between the CLIP-style multimodal loss and a SimCLR-style image-pair contrastive loss, and (iii) applies these weights in a joint, top-$k$ weighted objective to upweight underperforming clusters. This label-free approach adaptively targets the hardest examples, thereby reducing subgroup disparities. We evaluate our method on the Harvard FairVLMed glaucoma subset, reporting Equalized Odds Distance (EOD), Equalized Subgroup AUC (ES AUC), and Groupwise AUC to demonstrate equitable performance across inferred demographic subgroups.

[170] Deep Learning Models for Robust Facial Liveness Detection

Oleksandr Kuznetsov, Emanuele Frontoni, Luca Romeo, Riccardo Rosati, Andrea Maranesi, Alessandro Muscatello

Main category: cs.CV

TL;DR: The paper proposes a deep learning-based solution (AttackNet V2.2) to improve liveness detection in facial recognition systems, achieving 99.9% accuracy against advanced spoofing attacks like deepfakes.

Details

Motivation: Existing liveness detection methods fail against sophisticated spoofing attacks, necessitating a more robust solution.

Method: Novel deep learning models integrating texture analysis and reflective properties to distinguish real traits from replicas.

Result: AttackNet V2.2 achieved 99.9% accuracy across diverse datasets, outperforming existing systems.

Conclusion: The study advances biometric security, offering reliable authentication and insights into evolving spoofing tactics.

Abstract: In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies - designed to counteract these breaches - fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence-driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.

[171] Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices

Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: Proposes Turbo-VAED, a mobile-optimized VAE decoder, reducing parameters and latency while maintaining quality for real-time 720p video decoding.

Details

Motivation: Address computational bottlenecks of video VAEs on mobile devices, caused by large parameter sizes and mismatched kernels.

Method: Analyzes redundancy in VAEs, integrates 3D depthwise separable convolutions, proposes decoupled 3D pixel shuffle, and distills decoder for fast adaptation.

Result: Achieves 84.5x speedup, 17.5% parameter count, 96.9% quality retention, and 2.9x FPS improvement over mobile-optimized VAEs.

Conclusion: Turbo-VAED enables real-time mobile video decoding with low training cost and broad applicability.

Abstract: There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.

[172] HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis

Timo Teufel, Pulkit Gera, Xilong Zhou, Umar Iqbal, Pramod Rao, Jan Kautz, Vladislav Golyanik, Christian Theobalt

Main category: cs.CV

TL;DR: The paper introduces HumanOLAT, the first large-scale public dataset for full-body human relighting and novel-view rendering, addressing the lack of high-quality datasets.

Details

Motivation: The lack of publicly available, high-quality datasets for full-body human captures limits progress in relighting and novel-view rendering.

Method: The authors create the HumanOLAT dataset, featuring multi-view OLAT captures with HDR RGB frames under diverse illuminations.

Result: Evaluations show the dataset’s value and highlight challenges in modeling human-centric appearance and lighting.

Conclusion: HumanOLAT is expected to advance research by enabling rigorous benchmarking in relighting and rendering techniques.

Abstract: Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset’s value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.

[173] Efficient Annotation of Medieval Charters

Anguelos Nicolaou, Daniel Luger, Franziska Decker, Nicolas Renet, Vincent Christlein, Georg Vogeler

Main category: cs.CV

TL;DR: The paper proposes an efficient annotation method for medieval charter segmentation, reducing it to object detection, saving experts’ time and outperforming pixel-level segmentation in some cases. It also explores class ontology design and uses calibration cards for physical length annotation.

Details

Motivation: To improve the efficiency and accuracy of annotating medieval charters, reducing the burden on paleographers while maintaining high-quality results.

Method: The approach reduces charter segmentation to object detection, uses class ontology for efficient annotation, and employs calibration cards for physical length annotation with regression neural networks.

Result: The method is more efficient than pixel-level segmentation and can outperform it in certain scenarios. It also successfully predicts physical lengths from image patches.

Conclusion: The proposed annotation approach is effective and efficient, offering practical benefits for paleographers and improving the quality of charter segmentation.

Abstract: Diplomatics, the analysis of medieval charters, is a major field of research in which paleography is applied. Annotating data, if performed by laymen, needs validation and correction by experts. In this paper, we propose an effective and efficient annotation approach for charter segmentation, essentially reducing it to object detection. This approach allows for a much more efficient use of the paleographer’s time and produces results that can compete and even outperform pixel-level segmentation in some use cases. Further experiments shed light on how to design a class ontology in order to make the best use of annotators’ time and effort. Exploiting the presence of calibration cards in the image, we further annotate the data with the physical length in pixels and train regression neural networks to predict it from image patches.

[174] Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng

Main category: cs.CV

TL;DR: A new Transformer-based method improves text detection by refining polygon predictions iteratively, enhancing memory efficiency and speed while maintaining accuracy.

Details

Motivation: Existing methods for polygon-based text detection suffer from high memory usage and poor vertex relationship modeling, leading to suboptimal results for irregular text layouts.

Method: The approach uses a cascade decoding pipeline inspired by Sparse R-CNN, refining polygon predictions iteratively with attention to scale and location. It employs a single feature vector for regression, improving efficiency.

Result: The method reduces memory usage by over 50% and speeds up inference by over 40% compared to DPText-DETR, with only minor performance drops.

Conclusion: The proposed method offers a more efficient and effective solution for text detection, particularly for irregular layouts, balancing performance and resource usage.

Abstract: Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency (>50% less vs. the state-of-the-art method DPText-DETR) and reduces inference speed (>40% less vs. DPText-DETR) with minor performance drop on benchmarks.

[175] SSPFusion: A Semantic Structure-Preserving Approach for Infrared and Visible Image Fusion

Qiao Yang, Yu Zhang, Yutong Chen, Jian Zhang, Shunli Zhang

Main category: cs.CV

TL;DR: SSPFusion is a semantic structure-preserving approach for multi-modality image fusion, addressing inconsistency by using structural features and self-supervised signals.

Details

Motivation: Existing methods suffer from structural inconsistency due to improper use of semantic-level features.

Method: Uses a structural feature extractor (SFE) and multi-scale structure-preserving fusion (SPF) module guided by Sobel operator-generated signals.

Result: Outperforms nine state-of-the-art methods in infrared-visible and medical image fusion tasks.

Conclusion: SSPFusion ensures semantic structure consistency and demonstrates strong generalization.

Abstract: Most existing learning-based multi-modality image fusion (MMIF) methods suffer from significant structure inconsistency due to their inappropriate usage of structural features at the semantic level. To alleviate these issues, we propose a semantic structure-preserving fusion approach for MMIF, namely SSPFusion. At first, we design a structural feature extractor (SFE) to extract the prominent structural features from multiple input images. Concurrently, we introduce a transformation function with Sobel operator to generate self-supervised structural signals in these extracted features. Subsequently, we design a multi-scale structure-preserving fusion (SPF) module, guided by the generated structural signals, to merge the structural features of input images. This process ensures the preservation of semantic structure consistency between the resultant fusion image and the input images. Through the synergy of these two robust modules of SFE and SPF, our method can generate high-quality fusion images and demonstrate good generalization ability. Experimental results, on both infrared-visible image fusion and medical image fusion tasks, demonstrate that our method outperforms nine state-of-the-art methods in terms of both qualitative and quantitative evaluations. The code is publicly available at https://github.com/QiaoYang-CV/SSPFUSION.

[176] Un-EVIMO: Unsupervised Event-Based Independent Motion Segmentation

Ziyun Wang, Jinyuan Guo, Kostas Daniilidis

Main category: cs.CV

TL;DR: The paper introduces an unsupervised event-based framework for detecting independently moving objects (IMOs) using geometric constraints, eliminating the need for labeled data.

Details

Motivation: Event cameras' low latency and HDR properties make them suitable for IMO detection, but existing methods rely on labeled data, unlike biological vision systems.

Method: The proposed framework generates IMO pseudo-labels using geometric constraints, enabling unsupervised handling of arbitrary objects.

Result: The method performs competitively with supervised approaches on the EVIMO dataset, both quantitatively and qualitatively.

Conclusion: The unsupervised approach is scalable and effective for IMO detection, especially where labeled data is scarce.

Abstract: Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

[177] From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety

Shanle Yao, Babak Rahimi Ardabili, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Christopher Neff, Lauren Bourque, Hamed Tabkhi

Main category: cs.CV

TL;DR: The paper evaluates an AI-enabled Smart Video Solution (SVS) for enhancing public safety by integrating with existing camera networks, using pose-based data for anomaly detection, and providing real-time alerts. It demonstrates robust performance in a real-world deployment.

Details

Motivation: To enhance public safety by leveraging AI and existing infrastructure, prioritizing privacy and ethical standards while providing actionable insights.

Method: The SVS integrates AI-driven visual processing, statistical analysis, and cloud-based infrastructure. It uses innovative techniques like Occupancy Indicator and Heatmaps for anomaly detection and real-time alerts.

Result: The system managed 16 cameras with 16.5 FPS over 21 hours, achieving an average end-to-end latency of 26.76 seconds for anomaly detection to alert.

Conclusion: The SVS effectively converts complex computer vision outputs into actionable insights, proving its robustness and utility in real-world safety applications.

Abstract: This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird’s Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system’s robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.

[178] PointDreamer: Zero-shot 3D Textured Mesh Reconstruction from Colored Point Cloud

Qiao Yu, Xianzhi Li, Yuan Tang, Xu Han, Jinfeng Xu, Long Hu, Min Chen

Main category: cs.CV

TL;DR: PointDreamer is a novel framework for high-quality textured mesh reconstruction from 3D colored point clouds, leveraging 2D diffusion models and a unique project-inpaint-unproject pipeline to avoid blurry textures and artifacts.

Details

Motivation: Existing methods for colored-PC-to-mesh reconstruction often produce blurry textures or require hard-to-acquire 3D training data. PointDreamer addresses these limitations by utilizing 2D diffusion priors for superior texture quality.

Method: PointDreamer employs a project-inpaint-unproject pipeline: projecting point clouds into 2D images, inpainting with diffusion models, and unprojecting to 3D mesh. It also introduces a Non-Border-First (NBF) strategy to mitigate unprojection artifacts.

Result: PointDreamer achieves state-of-the-art performance (30% improvement in LPIPS score) and robustness to noisy, sparse, or incomplete input data, as demonstrated on synthetic and real-scanned datasets.

Conclusion: PointDreamer successfully adapts 2D diffusion models to 3D point cloud data, offering high-quality texture reconstruction without requiring 3D training data, and introduces solutions for common artifacts.

Abstract: Faithfully reconstructing textured meshes is crucial for many applications. Compared to text or image modalities, leveraging 3D colored point clouds as input (colored-PC-to-mesh) offers inherent advantages in comprehensively and precisely replicating the target object’s 360{\deg} characteristics. While most existing colored-PC-to-mesh methods suffer from blurry textures or require hard-to-acquire 3D training data, we propose PointDreamer, a novel framework that harnesses 2D diffusion prior for superior texture quality. Crucially, unlike prior 2D-diffusion-for-3D works driven by text or image inputs, PointDreamer successfully adapts 2D diffusion models to 3D point cloud data by a novel project-inpaint-unproject pipeline. Specifically, it first projects the point cloud into sparse 2D images and then performs diffusion-based inpainting. After that, diverging from most existing 3D reconstruction or generation approaches that predict texture in 3D/UV space thus often yielding blurry texture, PointDreamer achieves high-quality texture by directly unprojecting the inpainted 2D images to the 3D mesh. Furthermore, we identify for the first time a typical kind of unprojection artifact appearing in occlusion borders, which is common in other multiview-image-to-3D pipelines but less-explored. To address this, we propose a novel solution named the Non-Border-First (NBF) unprojection strategy. Extensive qualitative and quantitative experiments on various synthetic and real-scanned datasets demonstrate that PointDreamer, though zero-shot, exhibits SoTA performance (30% improvement on LPIPS score from 0.118 to 0.068), and is robust to noisy, sparse, or even incomplete input data. Code at: https://github.com/YuQiao0303/PointDreamer.

[179] OE3DIS: Open-Ended 3D Point Cloud Instance Segmentation

Phuc D. A. Nguyen, Minh Luu, Anh Tran, Cuong Pham, Khoi Nguyen

Main category: cs.CV

TL;DR: The paper introduces Open-Ended 3D Instance Segmentation (OE-3DIS), removing the need for predefined class names during testing. It proposes baselines and a new Open-Ended score, outperforming existing methods.

Details

Motivation: Current Open-Vocab 3D Instance Segmentation (OV-3DIS) methods rely on predefined class names during testing, limiting agent autonomy.

Method: Proposes OE-3DIS, leveraging 2D Multimodal Large Language Models and introducing a novel Open-Ended score for evaluation.

Result: Outperforms baselines and Open3DIS on ScanNet200 and ScanNet++ datasets, even without ground-truth class names.

Conclusion: OE-3DIS advances autonomy in 3D instance segmentation by eliminating predefined class dependencies.

Abstract: Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.

[180] SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Xilin He, Cheng Luo, Xiaole Xian, Bing Li, Muhammad Haris Khan, Zongyuan Ge, Weicheng Xie, Siyang Song, Linlin Shen, Bernard Ghanem, Xiangyu Yue

Main category: cs.CV

TL;DR: SynFER introduces a synthetic framework for generating facial expression data using textual descriptions and facial action units, addressing the limitations of small-scale datasets for deep learning models.

Details

Motivation: The subjectivity and labor-intensive nature of facial expression annotations limit dataset scale, hindering deep learning model performance.

Method: SynFER uses textual descriptions and facial action units for synthesis, with semantic guidance and pseudo-labeling to ensure quality.

Result: Achieves 67.23% accuracy on AffectNet with synthetic data, improving to 69.84% with scaled-up data.

Conclusion: SynFER effectively addresses dataset limitations and enhances facial expression analysis model performance.

Abstract: Facial expression datasets remain limited in scale due to the subjectivity of annotations and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, instead of introducing a new large-scale dataset, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel synthetic framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Results validate the efficacy of our approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Code is available here.

[181] REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents

Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: The paper introduces Reducio-VAE, a method to compress videos into a highly efficient latent space, reducing costs for training and inference in video generation models.

Details

Motivation: High training and inference costs limit large-scale applications of commercial video generation models, despite their high-fidelity results.

Method: An image-conditioned VAE compresses videos into a minimal latent space, enabling a 64x reduction in latents. A two-stage generation paradigm (text-to-image followed by text-image-to-video) is used with Reducio-DiT for efficient high-resolution video generation.

Result: The model achieves strong performance, with training completed in 3.2K A100 GPU hours and generation of a 16-frame 1024×1024 video clip in 15.5 seconds on a single A100 GPU.

Conclusion: Reducio-VAE significantly improves efficiency in video generation, making it practical for large-scale applications.

Abstract: Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents. Towards this goal, we design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images. This magic Reducio charm enables 64x reduction of latents compared to a common 2D VAE, without sacrificing the quality. Building upon Reducio-VAE, we can train diffusion models for high-resolution video generation efficiently. Specifically, we adopt a two-stage generation paradigm, first generating a condition image via text-to-image generation, followed by text-image-to-video generation with the proposed Reducio-DiT. Extensive experiments show that our model achieves strong performance in evaluation. More importantly, our method significantly boosts the training and inference efficiency of video LDMs. Reducio-DiT is trained in just 3.2K A100 GPU hours in total and can generate a 16-frame 1024$\times$1024 video clip within 15.5 seconds on a single A100 GPU. Code released at https://github.com/microsoft/Reducio-VAE .

[182] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Qiao Yu, Xianzhi Li, Yuan Tang, Xu Han, Long Hu, Yixue Hao, Min Chen

Main category: cs.CV

TL;DR: Fancy123 improves 3D mesh generation from a single image by addressing inconsistencies in multiview images and enhancing fidelity and clarity.

Details

Motivation: Existing methods for generating 3D meshes from a single image suffer from local inconsistencies in multiview images and lack fidelity or clarity in the final meshes.

Method: Fancy123 introduces two enhancement modules (appearance and fidelity) and an unprojection operation to realign pixels, match the input image, and ensure clarity.

Result: Fancy123 achieves state-of-the-art performance with significant improvements in qualitative and quantitative experiments.

Conclusion: Fancy123’s plug-and-play modules enhance existing single-image-to-3D methods, offering better consistency, fidelity, and clarity.

Abstract: Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM’s generated mesh ensures high clarity, discarding LRM’s predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123’s SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods. Code at: https://github.com/YuQiao0303/Fancy123

[183] PAD-F: Prior-Aware Debiasing Framework for Long-Tailed X-ray Prohibited Item Detection

Haoyu Wang, Renshuai Tao, Wei Wang, Yunchao Wei

Main category: cs.CV

TL;DR: PAD-F improves long-tailed object detection in X-ray security imagery using material and co-occurrence priors, achieving significant performance gains.

Details

Motivation: Addressing the challenge of long-tailed distribution in prohibited item detection in X-ray images, where conventional methods fail due to unique imaging principles.

Method: Introduces PAD-F with Explicit Material-Aware Augmentation (EMAA) for data-level augmentation and Implicit Co-occurrence Aggregator (ICA) for feature-level enhancement.

Result: Achieves up to +17.2% AP50 improvement for tail classes on HiXray and PIDray datasets, outperforming state-of-the-art methods.

Conclusion: PAD-F provides an effective solution for long-tailed detection in X-ray security, enhancing detector performance.

Abstract: Detecting prohibited items in X-ray security imagery is a challenging yet crucial task. With the rapid advancement of deep learning, object detection algorithms have been widely applied in this area. However, the distribution of object classes in real-world prohibited item detection scenarios often exhibits a distinct long-tailed distribution. Due to the unique principles of X-ray imaging, conventional methods for long-tailed object detection are often ineffective in this domain. To tackle these challenges, we introduce the Prior-Aware Debiasing Framework (PAD-F), a novel approach that employs a two-pronged strategy leveraging both material and co-occurrence priors. At the data level, our Explicit Material-Aware Augmentation (EMAA) component generates numerous challenging training samples for tail classes. It achieves this through a placement strategy guided by material-specific absorption rates and a gradient-based Poisson blending technique. At the feature level, the Implicit Co-occurrence Aggregator (ICA) acts as a plug-in module that enhances features for ambiguous objects by implicitly learning and aggregating statistical co-occurrence relationships within the image. Extensive experiments on the HiXray and PIDray datasets demonstrate that PAD-F significantly boosts the performance of multiple popular detectors. It achieves an absolute improvement of up to +17.2% in AP50 for tail classes and comprehensively outperforms existing state-of-the-art methods. Our work provides an effective and versatile solution to the critical problem of long-tailed detection in X-ray security.

[184] WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, Linlin Shen

Main category: cs.CV

TL;DR: The paper introduces WSI-Bench, a benchmark for evaluating MLLMs in pathology, and WSI-LLaVA, a framework for gigapixel WSI understanding, showing superior performance in morphological analysis.

Details

Motivation: Address limitations of current MLLMs in analyzing whole slide images (WSIs) and capturing crucial morphological features for diagnosis.

Method: Develop WSI-Bench (180k VQA pairs from 9,850 WSIs) and WSI-LLaVA, a three-stage training framework (WSI-text alignment, feature space alignment, task-specific tuning).

Result: WSI-LLaVA outperforms existing models, with significant improvements in morphological analysis and diagnostic accuracy.

Conclusion: The work establishes a strong link between morphological understanding and diagnostic accuracy, advancing computational pathology.

Abstract: Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs’ understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

[185] Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video

Renlong Wu, Zhilu Zhang, Mingyang Chen, Zifei Yan, Wangmeng Zuo

Main category: cs.CV

TL;DR: Deblur4DGS reconstructs high-quality 4D models from blurry monocular videos using 3D Gaussian Splatting, outperforming existing methods in tasks like deblurring and frame interpolation.

Details

Motivation: Existing 4D reconstruction methods struggle with motion blur in videos, leading to inaccurate dynamic representations. Deblur4DGS aims to address this limitation.

Method: The approach transforms dynamic representation estimation into exposure time estimation, introduces regularization terms, and uses blur-aware variable canonical Gaussians.

Result: Deblur4DGS achieves superior performance in synthetic and real-world data for tasks like deblurring and video stabilization.

Conclusion: Deblur4DGS advances 4D reconstruction by effectively handling motion blur and improving video quality across multiple applications.

Abstract: Recent 4D reconstruction methods have yielded impressive results but rely on sharp videos as supervision. However, motion blur often occurs in videos due to camera shake and object movement, while existing methods render blurry results when using such videos for reconstructing 4D models. Although a few approaches attempted to address the problem, they struggled to produce high-quality results, due to the inaccuracy in estimating continuous dynamic representations within the exposure time. Encouraged by recent works in 3D motion trajectory modeling using 3D Gaussian Splatting (3DGS), we take 3DGS as the scene representation manner, and propose Deblur4DGS to reconstruct a high-quality 4D model from blurry monocular video. Specifically, we transform continuous dynamic representations estimation within an exposure time into the exposure time estimation. Moreover, we introduce the exposure regularization term, multi-frame, and multi-resolution consistency regularization term to avoid trivial solutions. Furthermore, to better represent objects with large motion, we suggest blur-aware variable canonical Gaussians. Beyond novel-view synthesis, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame interpolation, and video stabilization. Extensive experiments in both synthetic and real-world data on the above four tasks show that Deblur4DGS outperforms state-of-the-art 4D reconstruction methods. The codes are available at https://github.com/ZcsrenlongZ/Deblur4DGS.

[186] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang

Main category: cs.CV

TL;DR: The paper introduces an autoregressive transformer for video generation, reducing latency by distilling a 50-step diffusion model into a 4-step generator, achieving high-quality results at 9.4 FPS.

Details

Motivation: Address the inefficiency of bidirectional attention in current video diffusion models for interactive applications.

Method: Adapt a pretrained bidirectional diffusion transformer to autoregressive, use DMD for distillation, and introduce student initialization and asymmetric distillation.

Result: Achieves 84.27 on VBench-Long, 9.4 FPS generation, and enables zero-shot video-to-video translation.

Conclusion: The approach enables efficient, high-quality video generation and interactive applications.

Abstract: Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner.

[187] Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

Yanlai Yang, Mengye Ren

Main category: cs.CV

TL;DR: The paper proposes ‘Memory Storyboard’ for self-supervised learning from real-world egocentric video streams, improving representation learning over static images or artificial data.

Details

Motivation: To explore realistic learning substrates by focusing on long-form real-world egocentric video streams, moving beyond static images or artificial data.

Method: Introduces ‘Memory Storyboard’ with a two-tier memory hierarchy (short-term and long-term) for temporal segmentation and contrastive learning on storyboard frames.

Result: Outperforms state-of-the-art unsupervised continual learning methods on datasets like SAYCam and KrishnaCam.

Conclusion: The approach yields semantically meaningful representations, demonstrating the effectiveness of temporal segmentation in self-supervised learning.

Abstract: Self-supervised learning holds the promise of learning good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose “Memory Storyboard” that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations that outperform those produced by state-of-the-art unsupervised continual learning methods.

[188] Zero-shot Emotion Annotation in Facial Images Using Large Multimodal Models: Benchmarking and Prospects for Multi-Class, Multi-Frame Approaches

He Zhang, Xinyi Fu

Main category: cs.CV

TL;DR: The study explores using GPT-4o-mini for zero-shot emotion annotation in videos, achieving 50% precision for seven-class and 64% for ternary classification. Multi-frame integration slightly improves accuracy.

Details

Motivation: To assess the feasibility of large multimodal models (LMMs) for cost-effective, automated emotion annotation in everyday scenarios.

Method: Experiments on the DailyLife subset of FERV39k using GPT-4o-mini for zero-shot labeling of video key frames under seven-class and ternary emotion taxonomies. Multi-frame integration was also tested.

Result: 50% average precision for seven-class, 64% for ternary classification. Multi-frame integration slightly boosted accuracy.

Conclusion: Zero-shot LMMs show promise for emotion annotation, reducing costs and expanding their use in multimodal tasks.

Abstract: This study investigates the feasibility and performance of using large multimodal models (LMMs) to automatically annotate human emotions in everyday scenarios. We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, employing the GPT-4o-mini model for rapid, zero-shot labeling of key frames extracted from video segments. Under a seven-class emotion taxonomy (“Angry,” “Disgust,” “Fear,” “Happy,” “Neutral,” “Sad,” “Surprise”), the LMM achieved an average precision of approximately 50%. In contrast, when limited to ternary emotion classification (negative/neutral/positive), the average precision increased to approximately 64%. Additionally, we explored a strategy that integrates multiple frames within 1-2 second video clips to enhance labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary findings highlight the potential application of zero-shot LMMs in human facial emotion annotation tasks, offering new avenues for reducing labeling costs and broadening the applicability of LMMs in complex multimodal environments.

[189] Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

Haoran Chen, Ping Wang, Zihan Zhou, Xu Zhang, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: A novel prompt-based method for class-incremental learning (CIL) reduces computational overhead by modifying CLS token attention instead of concatenating prompts, improving efficiency and performance.

Details

Motivation: Address the computational inefficiency in existing prompt-based CIL methods due to prompt pool querying and increased input sequence lengths.

Method: Trains a single set of shared prompts across tasks and modifies CLS token attention computation by adding prompts to it, avoiding concatenation.

Result: Significantly reduces computational complexity, eliminates the need for prompt length optimization, and performs well on CIL and general recognition benchmarks.

Conclusion: The method offers a lightweight, efficient, and powerful solution for rehearsal-free CIL and general parameter-efficient fine-tuning.

Abstract: Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token’s attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity-both in terms of inference costs and the number of trainable parameters-but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising candidate for a general parameter-efficient fine-tuning approach.

[190] OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions

Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin

Main category: cs.CV

TL;DR: OSMa-Bench is a dynamically configurable, LLM/LVLM-powered pipeline for evaluating Open Semantic Mapping (OSM) solutions, focusing on robustness under varying indoor lighting conditions.

Details

Motivation: To address the challenge of evaluating semantic mapping algorithms in diverse lighting conditions, crucial for indoor robotic perception.

Method: Introduces a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, and evaluates state-of-the-art models (ConceptGraphs, BBQ, OpenScene) using semantic fidelity and Scene Graph analysis.

Result: Provides insights into model robustness, highlighting performance variations under different lighting conditions.

Conclusion: The study lays groundwork for future research on resilient and adaptable robotic systems, with a publicly available project page.

Abstract: Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa-Bench/.

[191] Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process

Yuanze Li, Shihao Yuan, Haolin Wang, Qizhang Li, Ming Liu, Chen Xu, Guangming Shi, Wangmeng Zuo

Main category: cs.CV

TL;DR: The paper introduces Triad, a novel LMM-based method for industrial anomaly detection (IAD), addressing gaps in generalization by incorporating defect cognition and manufacturing process understanding.

Details

Motivation: General-purpose LMMs lack defect cognition and understanding of defect causes in IAD, limiting their effectiveness.

Method: Modifies the AnyRes structure of LLaVA and introduces a manufacturing-driven IAD paradigm with InstructIAD and CoT-M.

Result: Triad outperforms current LMMs and achieves higher accuracy with manufacturing process integration.

Conclusion: Triad effectively bridges the gap in IAD by combining expert-guided defect focus and manufacturing insights.

Abstract: Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at https://github.com/tzjtatata/Triad.

[192] SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Haiyang Ying, Matthias Zwicker

Main category: cs.CV

TL;DR: The paper introduces SketchSplat, a method for reconstructing accurate, complete, and compact 3D edges from multi-view images using differentiable sketch splatting and adaptive topological operations.

Details

Motivation: Existing methods for 3D edge reconstruction from multi-view images suffer from noise-induced gaps and misalignment with input images due to reliance on 3D point sets.

Method: Proposes SketchSplat, representing 3D edges as parametric sketches (lines/curves) and optimizing them via differentiable rasterization of Gaussian points onto 2D edge images. Includes adaptive topological operations for compactness.

Result: Achieves state-of-the-art accuracy, completeness, and compactness on a CAD benchmark dataset.

Conclusion: SketchSplat effectively bridges 2D and 3D edge reconstruction, ensuring alignment and compactness through differentiable optimization and adaptive operations.

Abstract: Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image loss can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations to reduce redundant edges and apply them along with the sketch optimization, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.

[193] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Mehrdad Fazli, Bowen Wei, Ahmet Sari, Ziwei Zhu

Main category: cs.CV

TL;DR: CAAC framework reduces hallucination in LVLMs by balancing attention biases and reinforcing visual grounding.

Details

Motivation: Address hallucination in LVLMs where models describe non-existent objects or attributes, especially in open-ended and long-form generation.

Method: Two-step approach: Visual-Token Calibration (VTC) balances attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) reinforces visual grounding based on model confidence.

Result: Outperforms baselines on CHAIR, AMBER, and POPE benchmarks, reducing hallucination, especially in long-form generations.

Conclusion: CAAC effectively mitigates hallucination in LVLMs by addressing spatial perception and modality biases.

Abstract: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

[194] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

Nikolai Röhrich, Alwin Hoffmann, Richard Nordsieck, Emilio Zarbali, Alireza Javanmardi

Main category: cs.CV

TL;DR: The paper proposes a self pre-training framework for vision transformers (ViT) in microelectronics defect detection, addressing data scarcity and domain dissimilarity issues. It outperforms CNNs and other ViT approaches.

Details

Motivation: Transformers outperform CNNs in many vision tasks but lag in microelectronics defect detection due to data scarcity and domain dissimilarity. Transfer learning from natural images is ineffective here.

Method: A resource-efficient ViT pre-training framework using masked autoencoders (MAE) is introduced, pre-trained directly on the target dataset (under 10,000 SAM images).

Result: The approach outperforms supervised ViT, ViT pre-trained on natural images, and state-of-the-art CNNs. It also focuses on defect-relevant features, improving interpretability.

Conclusion: Self pre-training yields defect-specific feature representations, making transformers more interpretable and generalizable for data-sparse microelectronics defect detection.

Abstract: While transformers have surpassed convolutional neural networks (CNNs) in various computer vision tasks, microelectronics defect detection still largely relies on CNNs. We hypothesize that this gap is due to the fact that a) transformers have an increased need for data and b) (labelled) image generation procedures for microelectronics are costly, and data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. We address this challenge through self pre-training, where models are pre-trained directly on the target dataset, rather than another dataset. We propose a resource-efficient vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). We perform pre-training and defect detection using a dataset of less than 10,000 scanning acoustic microscopy (SAM) images. Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in microelectronics. Additionally, interpretability analysis reveals that our self pre-trained models attend to defect-relevant features such as cracks in the solder material, while baseline models often attend to spurious patterns. This shows that our approach yields defect-specific feature representations, resulting in more interpretable and generalizable transformer models for this data-sparse domain.

[195] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal

Main category: cs.CV

TL;DR: The study quantifies cultural misalignment in text-to-image (T2I) models, revealing a 44% failure rate in meeting cultural expectations, with explicit failures at 68% and implicit at 49%. It introduces CulturalFrames, a benchmark for evaluating cultural representation, and highlights poor correlation of existing metrics with human judgments.

Details

Motivation: Address concerns about T2I models' ability to represent diverse cultural contexts accurately, as failures can stereotype communities and reduce usability.

Method: Introduces CulturalFrames, a benchmark with 983 prompts, 3637 images from 4 T2I models, and 10k human annotations across 10 countries and 5 socio-cultural domains. Quantifies alignment with explicit and implicit cultural expectations.

Result: Cultural expectations are missed 44% of the time (explicit: 68%, implicit: 49%). Existing T2I metrics poorly correlate with human judgments of cultural alignment.

Conclusion: The study exposes gaps in T2I models’ cultural representation, provides a testbed (CulturalFrames), and suggests actionable steps for improving global usability.

Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts – where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt’s cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we show that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, provide a concrete testbed, and outline actionable directions for developing culturally informed T2I models and metrics that improve global usability.

[196] Minimal Sensing for Orienting a Solar Panel

Jeremy Klotz, Shree K. Nayar

Main category: cs.CV

TL;DR: A method using four photodetectors to optimize solar panel orientation for maximum irradiance, overcoming local maxima by blurring the irradiance function, validated via simulations and real-world experiments.

Details

Motivation: To maximize solar energy harvesting by addressing the challenge of multiple local maxima in irradiance functions, which complicates gradient-based optimization.

Method: Uses four photodetectors to iteratively adjust panel tilt, blurring the irradiance function to eliminate local maxima, enabling gradient ascent. Validated with simulations and real-world experiments.

Result: Improved energy harvesting in diverse environments (direct sunlight, cloudy skies, urban settings, indoor lighting) compared to standard methods.

Conclusion: The approach effectively maximizes solar panel irradiance by transforming the problem into a unimodal one, demonstrating robustness in varied conditions.

Abstract: A solar panel harvests the most energy when pointing in the direction that maximizes the total illumination (irradiance) falling on it. Given an arbitrary panel orientation and an arbitrary environmental illumination, we address the problem of finding the direction of maximum total irradiance. We develop a minimal sensing approach where measurements from just four photodetectors are used to iteratively vary the tilt of the panel to maximize the irradiance. Many environments produce irradiance functions with multiple local maxima. As a result, simply measuring the gradient of the irradiance function and applying gradient ascent will not work. We show that a larger, optimized tilt between the detectors and the panel is equivalent to blurring the irradiance function. This has the effect of eliminating local maxima and turning the irradiance function into a unimodal one, whose maximum can be found using gradient ascent. We show that there is a close relationship between our approach and scale space theory. We collected a large dataset of high-dynamic range lighting environments in Manhattan, called UrbanSky. We use this dataset to conduct simulations to verify the robustness of our approach. Next, we simulate the energy harvested using our approach under dynamic illumination. Finally, we built a portable solar panel with four compact detectors and an actuator to conduct experiments in various real-world settings: direct sunlight, cloudy sky, urban settings with occlusions and shadows, and complex indoor lighting. In all cases, we show improvements in harvested energy compared to standard approaches for orienting a solar panel.

[197] Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving

Stefan Englmeier, Max A. Büttner, Katharina Winter, Fabian B. Flohr

Main category: cs.CV

TL;DR: A novel framework for retrieving rare human behavior scenarios in autonomous driving datasets using multimodal embeddings and text queries, outperforming state-of-the-art models.

Details

Motivation: To address the challenge of identifying rare and complex human behaviors in large-scale datasets for robust evaluation of autonomous driving systems.

Method: Combines SMPL-based motion sequences and video frames into a shared multimodal embedding space aligned with natural language for scalable retrieval via text queries.

Result: Achieves up to 27.5% higher accuracy in motion-context retrieval compared to state-of-the-art models on the WayMoCo dataset.

Conclusion: The proposed framework effectively retrieves human behavior scenarios, enhancing evaluation and generalization for autonomous driving systems.

Abstract: Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.

[198] SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback

Elior Benarous, Yilun Du, Heng Yang

Main category: cs.CV

TL;DR: SPIE is a reinforcement learning-based method for improving instruction-based image editing diffusion models, enhancing alignment with user prompts and input image consistency without heavy human annotation.

Details

Motivation: Addressing challenges in aligning image edits with user instructions and maintaining input image consistency, SPIE aims to simplify the process while improving realism and control.

Method: An online reinforcement learning framework aligns the diffusion model with human preferences, using visual prompts for detailed control and requiring minimal training data (5 reference images).

Result: SPIE achieves precise, structurally coherent edits in complex scenes with high fidelity, requiring only 10 training steps. It also enhances robotics simulations’ visual realism.

Conclusion: SPIE simplifies highly specific image edits, demonstrating versatility in complex scenes and robotics applications, with minimal training requirements.

Abstract: This paper presents SPIE: a novel approach for semantic and structural post-training of instruction-based image editing diffusion models, addressing key challenges in alignment with user prompts and consistency with input images. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the alignment with instructions and realism in two ways. First, SPIE captures fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. Second, it achieves precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. This approach simplifies users’ efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where targeted image edits enhance the visual realism of simulated environments, which improves their utility as proxy for real-world settings.

Songsong Xiong, Hamidreza Kasaei

Main category: cs.CV

TL;DR: The paper introduces LM-MCVT, a lightweight multi-modal multi-view network, for improved 3D object recognition in robotics, achieving 95.6% accuracy on ModelNet40 and robust performance on real-world data.

Details

Motivation: Robots struggle with 3D object recognition in complex human-centered environments due to diverse object shapes and variability.

Method: Proposes LM-MCVT with GEEF for multi-view fusion, combining convolutional encoders and transformers for feature extraction.

Result: Achieves 95.6% accuracy on ModelNet40 and superior performance on OmniObject3D via 5-fold cross-validation.

Conclusion: LM-MCVT is robust and effective for 3D object recognition in both synthetic and real-world settings.

Abstract: In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method’s robustness in 3D object recognition across synthetic and real-world 3D data.

[200] SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition

Mengqi Lei, Yihong Wu, Siqi Li, Xinhu Zheng, Juan Wang, Yue Gao, Shaoyi Du

Main category: cs.CV

TL;DR: SoftHGNN introduces soft hyperedges with continuous weights to capture high-order visual semantics efficiently, outperforming traditional methods.

Details

Motivation: Existing hypergraph neural networks use static, hard hyperedge assignments, leading to redundancy and overlooking visual semantics continuity.

Method: SoftHGNN uses learnable hyperedge prototypes for dynamic, differentiable associations and includes sparse hyperedge selection for efficiency.

Result: SoftHGNN achieves significant performance improvements across three tasks on five datasets.

Conclusion: SoftHGNN effectively captures high-order associations in visual scenes, offering a versatile and efficient solution.

Abstract: Visual recognition relies on understanding both the semantics of image tokens and the complex interactions among them. Mainstream self-attention methods, while effective at modeling global pair-wise relations, fail to capture high-order associations inherent in real-world scenes and often suffer from redundant computation. Hypergraphs extend conventional graphs by modeling high-order interactions and offer a promising framework for addressing these limitations. However, existing hypergraph neural networks typically rely on static and hard hyperedge assignments, leading to excessive and redundant hyperedges with hard binary vertex memberships that overlook the continuity of visual semantics. To overcome these issues, we present Soft Hypergraph Neural Networks (SoftHGNNs), which extend the methodology of hypergraph computation, to make it truly efficient and versatile in visual recognition tasks. Our framework introduces the concept of soft hyperedges, where each vertex is associated with hyperedges via continuous participation weights rather than hard binary assignments. This dynamic and differentiable association is achieved by using the learnable hyperedge prototype. Through similarity measurements between token features and the prototype, the model generates semantically rich soft hyperedges. SoftHGNN then aggregates messages over soft hyperedges to capture high-order semantics. To further enhance efficiency when scaling up the number of soft hyperedges, we incorporate a sparse hyperedge selection mechanism that activates only the top-k important hyperedges, along with a load-balancing regularizer to ensure balanced hyperedge utilization. Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements.

[201] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang

Main category: cs.CV

TL;DR: ViStoryBench is a new benchmark for evaluating story visualization models, addressing gaps in existing benchmarks by covering diverse narratives, styles, and character settings with rich annotations and automated metrics.

Details

Motivation: Existing benchmarks for story visualization are limited in scope, lacking real-world complexity and hindering a nuanced understanding of model capabilities.

Method: ViStoryBench uses multi-shot scripts from curated stories, assisted by large language models and human verification, with carefully curated character references. Automated metrics assess consistency, style, adherence, quality, and artifacts.

Result: The benchmark evaluates a range of models, validated by human studies, providing a high-fidelity, multi-dimensional evaluation suite.

Conclusion: ViStoryBench facilitates systematic analysis and advances in visual storytelling by addressing current limitations.

Abstract: Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, no character reference, or single-image cases, and fall short of real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt adherence, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a high-fidelity, multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.

[202] Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

Main category: cs.CV

TL;DR: Context-as-Memory enhances long video generation by using historical context as memory, with efficient frame storage and retrieval, outperforming SOTAs.

Details

Motivation: Existing methods lack scene-consistent memory in long video generation due to limited historical context use.

Method: Proposes Context-as-Memory with frame-based storage and concatenation for conditioning, plus a Memory Retrieval module for efficient context selection.

Result: Achieves superior memory capabilities and generalizes to open-domain scenarios.

Conclusion: Context-as-Memory is effective for interactive long video generation, reducing computational overhead while maintaining performance.

Abstract: Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.

[203] Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift

Debarshi Brahma, Soma Biswas

Main category: cs.CV

TL;DR: The paper proposes MIST, a framework for adapting CLIP to datasets with extreme distribution shifts using few labeled examples, handling all classes simultaneously.

Details

Motivation: Existing few-shot learning methods for CLIP don't handle extreme domain shifts well, and cross-domain few-shot learning methods are limited to episodic settings.

Method: MIST introduces multiple learnable prompts per class, modeled as Gaussian distributions, to capture diverse visual modes and enhance generalization.

Result: Experiments show MIST outperforms state-of-the-art methods in adapting CLIP to domain-shifted datasets.

Conclusion: MIST effectively addresses the challenge of extreme distribution shifts in few-shot learning for CLIP, offering practical applicability.

Abstract: Foundation Vision-Language Models (VLMs) like CLIP exhibit strong generalization capabilities due to large-scale pretraining on diverse image-text pairs. However, their performance often degrades when applied to target datasets with significant distribution shifts in both visual appearance and class semantics. Recent few-shot learning approaches adapt CLIP to downstream tasks using limited labeled data via adapter or prompt tuning, but are not specifically designed to handle such extreme domain shifts. Conversely, some works addressing cross-domain few-shot learning consider such domain-shifted scenarios but operate in an episodic setting with only a few classes per episode, limiting their applicability to real-world deployment, where all classes must be handled simultaneously. To address this gap, we propose a novel framework, MIST (Multiple Stochastic Prompt Tuning), for efficiently adapting CLIP to datasets with extreme distribution shifts using only a few labeled examples, in scenarios involving all classes at once. Specifically, we introduce multiple learnable prompts per class to effectively capture diverse modes in visual representations arising from distribution shifts. To further enhance generalization, these prompts are modeled as learnable Gaussian distributions, enabling efficient exploration of the prompt parameter space and reducing overfitting caused by limited supervision. Extensive experiments and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed framework.

[204] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

Suhan Woo, Seongwon Lee, Jinwoo Jang, Euntai Kim

Main category: cs.CV

TL;DR: HypeVPR introduces a hierarchical embedding framework in hyperbolic space for Visual Place Recognition (VPR), leveraging panoramic views’ hierarchical structures to improve accuracy and efficiency.

Details

Motivation: Address the challenges of Perspective-to-Equirectangular (P2E) VPR by exploiting hierarchical structures in panoramic views.

Method: Uses hyperbolic space for hierarchical feature representation and a coarse-to-fine search strategy for efficient retrieval.

Result: Outperforms existing methods, accelerates retrieval, and reduces storage needs.

Conclusion: HypeVPR is effective for P2E VPR, offering improved performance and efficiency.

Abstract: When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy to enable flexible control over accuracy-efficiency trade-offs and ensure robust matching even between descriptors from different image types. This approach allows HypeVPR to outperform existing methods while significantly accelerating retrieval and reducing database storage requirements. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.

[205] Investigating the Relationship between the Weighted Figure of Merit and Rosin’s Measure

Bimal Kumar Ray

Main category: cs.CV

TL;DR: The paper investigates the relationship between the weighted figure of merit and Rosin’s measure for polygonal approximation, finding them theoretically and empirically independent.

Details

Motivation: To determine if the weighted figure of merit can substitute Rosin's measure for comparing suboptimal polygonal approximation schemes.

Method: Theoretical analysis, experimental investigation using a public dataset, and statistical analysis (Pearson’s correlation and non-linear measures).

Result: The two measures are theoretically independent and empirically uncorrelated.

Conclusion: The weighted figure of merit cannot replace Rosin’s measure for comparing suboptimal schemes.

Abstract: Many studies have been conducted to solve the problem of approximating a digital boundary by piece straight-line segments for the further processing required in computer vision applications. The authors of these studies compared their schemes to determine the best one. The initial measure used to assess the goodness of fit of a polygonal approximation was the figure of merit. Later,it was noted that this measure was not an appropriate metric for a valid reason which is why Rosin-through mathematical analysis-introduced a measure called merit. However,this measure involves an optimal scheme of polygonal approximation,so it is time-consuming to compute it to assess the goodness of fit of an approximation. This led many researchers to use a weighted figure of merit as a substitute for Rosin’s measure to compare sub optimal schemes. An attempt is made in this communication to investigate whether the two measures-weighted figure of merit and Rosin’s measure-are related so that one can be used instead of the other, and toward this end, theoretical analysis, experimental investigation and statistical analysis are carried out. The mathematical formulas for the weighted figure of merit and Rosin’s measure are analyzed, and through proof of theorems,it is found that the two measures are theoretically independent of each other. The graphical analysis of experiments carried out using a public dataset supports the results of the theoretical analysis. The statistical analysis via Pearson’s correlation coefficient and non-linear correlation measure also revealed that the two measures are uncorrelated. This analysis leads one to conclude that if a suboptimal scheme is found to be better (worse) than some other suboptimal scheme,as indicated by Rosin’s measure,then the same conclusion cannot be drawn using a weighted figure of merit,so one cannot use a weighted figure of merit instead of Rosin’s measure.

[206] MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: The paper proposes MUG, a method combining pseudo-labeling and an audio-visual Mamba network to improve weakly-supervised AVVP by enhancing segment-level and event-level predictions.

Details

Motivation: Existing methods struggle with improving both segment-level and event-level predictions due to weakly-supervised limitations and model deficiencies.

Method: Uses pseudo-labeling for data augmentation and an audio-visual Mamba network for feature processing and noise exclusion.

Result: MUG achieves state-of-the-art results on the LLP dataset, with gains of 2.1% and 1.2% in visual and audio segment-level metrics.

Conclusion: MUG effectively enhances AVVP performance by addressing noise interference and improving segment uniqueness.

Abstract: The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model’s ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.

[207] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang, Yangfan He, Bin Li

Main category: cs.CV

TL;DR: Synergos-VQA introduces a synergistic reasoning framework for KBVQA, combining holistic, structural, and causal evidence to outperform existing MLLMs.

Details

Motivation: Current MLLMs rely on uni-dimensional evidence, limiting robust understanding. Synergos-VQA aims to address this by integrating multi-faceted evidence.

Method: The framework generates and fuses three evidence streams: holistic (scene perception), structural (key objects), and causal (counterfactual grounding).

Result: Synergos-VQA achieves state-of-the-art performance on benchmarks like OK-VQA and A-OKVQA, and enhances open-source MLLMs.

Conclusion: Superior methodological design, as demonstrated by Synergos-VQA, can outperform reliance on model scale alone.

Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.

[208] GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution

Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi

Main category: cs.CV

TL;DR: GPSMamba introduces a framework combining adaptive semantic-frequency prompts and non-causal supervision to enhance infrared image super-resolution, outperforming existing methods.

Details

Motivation: Infrared Image Super-Resolution (IRSR) struggles with low contrast and sparse textures, requiring robust long-range modeling. Existing methods like Mamba fragment 2D context due to 1D causal scanning.

Method: Proposes GPSMamba with an Adaptive Semantic-Frequency State Space Module (ASF-SSM) and Thermal-Spectral Attention with Phase Consistency Loss for non-causal supervision.

Result: GPSMamba achieves state-of-the-art performance in infrared image restoration.

Conclusion: The framework effectively mitigates causal modeling limitations, offering a powerful paradigm for IRSR.

Abstract: Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.

[209] RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng

Main category: cs.CV

TL;DR: The paper proposes RemoteReasoner, a unified geospatial reasoning workflow using a multi-modal large language model (MLLM) and reinforcement learning to handle complex queries autonomously without task-specific fine-tuning.

Details

Motivation: Existing remote sensing methods lack autonomous reasoning and unified generalization, relying on supervised fine-tuning and task-specific heads.

Method: RemoteReasoner integrates an MLLM for interpreting user instructions and localizing targets, with task transformation strategies for multi-granularity tasks, trained via reinforcement learning.

Result: RemoteReasoner achieves SOTA performance across multi-granularity tasks and demonstrates robust generalization on unseen tasks.

Conclusion: The framework advances geospatial reasoning by enabling autonomous, unified, and generalizable task handling without additional fine-tuning.

Abstract: Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM’s inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.

[210] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Rishav Kumar, D. Santhosh Reddy, P. Rajalakshmi

Main category: cs.CV

TL;DR: DriveIndia is a large-scale dataset for object detection in Indian traffic, featuring 66,986 images across 24 categories, with diverse conditions. Baseline results show 78.7% mAP50 using YOLO models.

Details

Motivation: To address the complexity and unpredictability of Indian traffic environments for autonomous driving research.

Method: Dataset creation with 66,986 high-resolution images annotated in YOLO format, covering diverse conditions like weather, illumination, and traffic patterns. Baseline evaluation using YOLO models.

Result: Top-performing YOLO variant achieved 78.7% mAP50.

Conclusion: DriveIndia serves as a benchmark for robust object detection in uncertain road conditions and will be publicly available.

Abstract: We introduce DriveIndia, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains 66,986 high-resolution images annotated in YOLO format across 24 traffic-relevant object categories, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over 120+ hours and covering 3,400+ kilometers across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art YOLO family models, with the top-performing variant achieving a mAP50 of 78.7%. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository https://tihan.iith.ac.in/TiAND.html (Terrestrial Datasets -> Camera Dataset).

[211] On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian

Main category: cs.CV

TL;DR: VLMs are vulnerable to subtle frequency-domain perturbations, affecting tasks like DeepFake detection and captioning, revealing their fragility and unreliability.

Details

Motivation: To expose vulnerabilities in VLMs when exposed to structured frequency-domain perturbations, undermining their reliability in critical tasks.

Method: Design targeted frequency-domain image transformations to perturb VLM outputs, testing across five state-of-the-art VLMs and ten datasets.

Result: VLMs are sensitive to frequency-based cues, with outputs not aligning with semantic content, exposing fragility in captioning and authenticity detection.

Conclusion: VLMs’ reliability is challenged under realistic conditions, highlighting the need for more robust multimodal perception systems.

Abstract: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

[212] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

Li Siyao, Yao Feng, Omid Taheri, Chen Change Loy, Michael J. Black

Main category: cs.CV

TL;DR: A novel ‘half-physics’ approach enhances SMPL-X 3D human models by enabling dynamic physical interactions, eliminating issues like interpenetration and unrealistic object dynamics without requiring learning.

Details

Motivation: Current kinematic-based 3D human models (e.g., SMPL-X) lack physical interaction capabilities, leading to problems like interpenetration and unrealistic dynamics.

Method: Proposes a ‘half-physics’ mechanism to convert kinematic motion into physics simulation, maintaining kinematic control while ensuring plausible interactions.

Result: The method avoids penetration and unrealistic dynamics, operates in real time, and generalizes to any body shape or motion without training.

Conclusion: The ‘half-physics’ approach effectively bridges the gap between kinematic models and physical realism, offering a practical solution for dynamic interactions.

Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a “half-physics” mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions

[213] How Does Bilateral Ear Symmetry Affect Deep Ear Features?

Kagan Ozturk, Deeksha Arun, Kevin W. Bowyer, Patrick Flynn

Main category: cs.CV

TL;DR: The paper explores the impact of bilateral ear symmetry on CNN-based ear recognition, showing that treating left and right ears separately improves performance.

Details

Motivation: To investigate how bilateral ear symmetry affects CNN-based ear recognition, as this aspect has been overlooked in prior research.

Method: Develop an ear side classifier to categorize images as left or right, then evaluate the impact of side information during training and testing across five datasets.

Result: Treating left and right ears separately enhances performance, and ablation studies offer insights for optimizing CNN-based systems.

Conclusion: Incorporating ear side information improves recognition accuracy, providing practical guidance for large-scale ear recognition systems.

Abstract: Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale datasets to achieve higher verification rates.

[214] GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving

Jian Wang, Chaokang Jiang, Haitao Xu

Main category: cs.CV

TL;DR: GMF-Drive introduces a gated Mamba fusion framework for autonomous driving, replacing transformers with efficient state-space models to improve performance and efficiency.

Details

Motivation: Current diffusion-based models for autonomous driving rely on transformer-based fusion, which has quadratic complexity and lacks spatial priors, limiting performance.

Method: GMF-Drive uses a geometrically-augmented pillar format for LiDAR and a hierarchical gated Mamba fusion (GM-Fusion) architecture with state-space models (SSMs) to replace transformers.

Result: GMF-Drive achieves state-of-the-art performance on NAVSIM, outperforming DiffusionDrive, with linear complexity and better spatial awareness.

Conclusion: Task-specific SSMs can surpass transformers in autonomous driving, offering improved performance and efficiency.

Abstract: Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird’s Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.

[215] Adversarial Video Promotion Against Text-to-Video Retrieval

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces ViPro, the first adversarial attack for promoting videos in text-to-video retrieval (T2VR), and proposes Modal Refinement (MoRe) to enhance transferability. It demonstrates superior performance over baselines and highlights vulnerabilities in T2VR systems.

Details

Motivation: Existing T2VR attacks focus on suppressing video ranks, but promoting videos is unexplored and potentially more impactful for financial or misinformation gains.

Method: Proposes ViPro for adversarial video promotion and MoRe for finer-grained cross-modal interaction to improve black-box transferability. Evaluated on multiple models, datasets, and scenarios.

Result: ViPro outperforms baselines by over 30%/10%/4% in white/grey/black-box settings. Comprehensive experiments validate effectiveness, imperceptibility, and defense resilience.

Conclusion: The work exposes a critical vulnerability in T2VR, provides bounds for attacks, and suggests counterplay insights. Code will be open-sourced.

Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

[216] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

Main category: cs.CV

TL;DR: HiMat is a diffusion-based framework for generating 4K SVBRDFs efficiently, using a CrossStitch module to maintain consistency across maps without altering the DiT backbone.

Details

Motivation: The need for detailed SVBRDFs in 3D content creation and the challenge of retargeting text-to-image models for multi-map generation.

Method: HiMat uses a CrossStitch module to capture inter-map dependencies with localized operations, preserving the DiT backbone’s capabilities.

Result: HiMat successfully generates 4K SVBRDFs with structural coherence and high-frequency details, validated by text prompts.

Conclusion: HiMat is effective for 4K SVBRDF generation and shows potential for tasks like intrinsic decomposition.

Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

[217] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

Junsheng Huang, Shengyu Hao, Bocheng Hu, Gaoang Wang

Main category: cs.CV

TL;DR: EgoDynamic4D is a new QA benchmark for dynamic 4D scene understanding, featuring RGB-D video, camera poses, and 4D annotations, with 927K QA pairs and 12 tasks. A proposed spatio-temporal framework outperforms baselines.

Details

Motivation: Existing egocentric datasets lack unified 4D annotations and task-driven evaluation for fine-grained spatio-temporal reasoning.

Method: Introduces EgoDynamic4D with RGB-D video, camera poses, and 4D bounding boxes. Proposes an end-to-end framework using instance-aware encoding and adaptive down-sampling.

Result: The method outperforms baselines, validating multimodal temporal modeling for dynamic scene understanding.

Conclusion: EgoDynamic4D and the proposed framework advance egocentric dynamic scene understanding, enabling verifiable spatio-temporal reasoning.

Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

[218] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: Omni-Effects is a unified framework for generating spatially controllable composite visual effects (VFX) using LoRA-MoE and SAP innovations, overcoming limitations of current single-effect methods.

Details

Motivation: Current VFX generation methods are limited to single effects due to per-effect LoRA training, hindering applications requiring multiple spatially controlled effects.

Method: Proposes Omni-Effects with LoRA-MoE for diverse effect integration and SAP for spatial control, plus an IIF module to isolate control signals.

Result: Achieves precise spatial control and diverse effect generation, validated by extensive experiments.

Conclusion: Omni-Effects enables users to specify effect categories and locations, advancing VFX production.

Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[219] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring

Ludan Zhang, Sihan Wang, Yuqi Dai, Shuofei Qiao, Qinyue Luo, Lei He

Main category: cs.CV

TL;DR: The paper proposes an independent evaluation method (FMCS) for feature maps in autonomous driving models, improving interpretability and performance.

Details

Motivation: End-to-end models lack explicit supervision for intermediate modules, limiting interpretability and evaluation.

Method: Introduces FMCS and DG-DWSS for feature map evaluation, and CLIP-FMQE-Net for real-time quality analysis.

Result: Integration improves 3D object detection by 3.89% in NDS on NuScenes dataset.

Conclusion: The method effectively enhances feature representation and model performance.

Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.

[220] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li

Main category: cs.CV

TL;DR: Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation, requiring minimal training data and parameters while outperforming existing methods.

Details

Motivation: Existing methods for identity-preserving video generation are resource-intensive and lack compatibility with other AIGC tools, necessitating a more efficient solution.

Method: Introduces a conditional image branch into a pre-trained video model, using restricted self-attentions and conditional position mapping, trained with only 2000 pairs.

Result: Achieves superior video quality and identity preservation with ~1% additional parameters, outperforming full-parameter methods.

Conclusion: Stand-In is versatile, integrating seamlessly into tasks like subject-driven video generation, pose-referenced generation, stylization, and face swapping.

Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

[221] Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu

Main category: cs.CV

TL;DR: Cut2Next introduces NSG for generating high-quality, cinematically coherent shots using a Diffusion Transformer and hierarchical prompting, outperforming in visual consistency and narrative flow.

Details

Motivation: Current methods lack narrative sophistication and cinematic integrity, focusing only on visual consistency.

Method: Uses a Diffusion Transformer (DiT) with Hierarchical Multi-Prompting, Relational Prompts, and architectural innovations like CACI and HAM.

Result: Cut2Next excels in visual consistency, text fidelity, and user preference for narrative and cinematic quality.

Conclusion: Cut2Next successfully bridges the gap in generating narratively expressive and cinematically coherent shots.

Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

[222] Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Xudong Cai, Shuo Wang, Peng Wang, Yongcai Wang, Zhaoxin Fan, Wanting Li, Tianbao Zhang, Jianrong Tao, Yeying Jin, Deying Li

Main category: cs.CV

TL;DR: Mem4D proposes a dual-memory framework to resolve the conflict between static and dynamic scene reconstruction in monocular videos, achieving high fidelity and efficiency.

Details

Motivation: Addressing the Memory Demand Dilemma in dynamic scene reconstruction, where existing methods compromise between static stability and dynamic detail retention.

Method: Introduces Mem4D with Transient Dynamics Memory (TDM) for dynamic motion and Persistent Structure Memory (PSM) for static geometry, alternating queries for balanced reconstruction.

Result: Achieves state-of-the-art performance on benchmarks, maintaining global consistency for static elements and high fidelity for dynamic ones.

Conclusion: Mem4D effectively decouples static and dynamic modeling, offering a robust solution for monocular video reconstruction.

Abstract: Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.

[223] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

Main category: cs.CV

TL;DR: Follow-Your-Shape is a training-free, mask-free framework for precise object shape editing, preserving non-target content using Trajectory Divergence Maps and Scheduled KV Injection.

Details

Motivation: Existing flow-based image editing models struggle with large-scale shape transformations, often degrading background quality.

Method: Uses Trajectory Divergence Maps (TDM) to locate editable regions and Scheduled KV Injection for stable editing.

Result: Superior editability and visual fidelity, especially in large-scale shape replacement tasks.

Conclusion: The method effectively addresses challenges in shape-aware editing while maintaining non-target content integrity.

Abstract: While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios – particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

[224] 3D Human Mesh Estimation from Single View RGBD

Ozhan Suat, Bedirhan Uguz, Batuhan Karagoz, Muhammed Can Keles, Emre Akbas

Main category: cs.CV

TL;DR: A method named M$^3$ (Masked Mesh Modeling) is proposed for accurate 3D human mesh estimation from a single RGBD view, leveraging MoCap datasets to overcome data scarcity. It outperforms existing methods on benchmark datasets.

Details

Motivation: RGBD cameras are underutilized despite their affordability and potential for 3D human mesh estimation. Existing datasets are limited, so the paper aims to address data scarcity by leveraging MoCap datasets.

Method: The method simulates partial single-view meshes from MoCap data, trains a masked autoencoder to complete them, and matches sensor depth values to a template mesh during inference.

Result: M$^3$ achieves 16.8 mm and 22.0 mm PVE on SURREAL and CAPE datasets, outperforming existing methods. It also shows competitive results on the BEHAVE dataset.

Conclusion: The proposed method effectively utilizes depth data and MoCap datasets to improve 3D human mesh estimation, demonstrating superior performance over existing approaches.

Abstract: Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling’’, matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.

cs.AI

[225] Topos Theory for Generative AI and LLMs

Sridhar Mahadevan

Main category: cs.AI

TL;DR: The paper introduces novel generative AI architectures (GAIAs) using topos theory, leveraging category theory to design compositional structures for LLMs, and validates their theoretical properties.

Details

Motivation: To explore new LLM architectures by applying topos theory and universal constructions in category theory, moving beyond traditional linear or mixture-of-experts designs.

Method: Uses topos theory and category theory to construct LLM architectures with pullback, pushout, (co)equalizers, exponential objects, and subobject classifiers. Validates the category of LLMs as (co)complete and a topos.

Result: Theoretical validation shows the category of LLMs is (co)complete and forms a topos, enabling novel compositional structures.

Conclusion: The work provides a foundation for designing advanced LLM architectures using category theory, with potential for practical implementation via functorial backpropagation.

Abstract: We propose the design of novel categorical generative AI architectures (GAIAs) using topos theory, a type of category that is set-like": a topos has all (co)limits, is Cartesian closed, and has a subobject classifier. Previous theoretical results on the Transformer model have shown that it is a universal sequence-to-sequence function approximator, and dense in the space of all continuous functions with compact support on the Euclidean space of embeddings of tokens. Building on this theoretical result, we explore novel architectures for LLMs that exploit the property that the category of LLMs, viewed as functions, forms a topos. Previous studies of large language models (LLMs) have focused on daisy-chained linear architectures or mixture-of-experts. In this paper, we use universal constructions in category theory to construct novel LLM architectures based on new types of compositional structures. In particular, these new compositional structures are derived from universal properties of LLM categories, and include pullback, pushout, (co) equalizers, exponential objects, and subobject classifiers. We theoretically validate these new compositional structures by showing that the category of LLMs is (co)complete, meaning that all diagrams have solutions in the form of (co)limits. Building on this completeness result, we then show that the category of LLMs forms a topos, a set-like" category, which requires showing the existence of exponential objects as well as subobject classifiers. We use a functorial characterization of backpropagation to define a potential implementation of an LLM topos architecture.

[226] Topos Causal Models

Sridhar Mahadevan

Main category: cs.AI

TL;DR: Topos causal models (TCMs) leverage topos category properties for causal inference, enabling solutions for complex causal diagrams, interventions via subobject classifiers, and reasoning about causal equivalences.

Details

Motivation: To address limitations in existing causal models by utilizing the mathematical properties of topos categories for more flexible and powerful causal inference.

Method: Introduce TCMs, proving their (co)completeness, and demonstrate applications like causal intervention (via subobject classifiers), solving causal diagrams (via limits/colimits), and reasoning about equivalences (via exponential objects).

Result: TCMs provide a framework to solve arbitrary causal diagrams, model interventions categorically, and reason about causal equivalences, with an internal logic for formal reasoning.

Conclusion: TCMs offer a robust, category-theoretic approach to causal inference, unifying diverse applications and enabling new theoretical and practical advancements.

Abstract: We propose topos causal models (TCMs), a novel class of causal models that exploit the key properties of a topos category: they are (co)complete, meaning all (co)limits exist, they admit a subobject classifier, and allow exponential objects. The main goal of this paper is to show that these properties are central to many applications in causal inference. For example, subobject classifiers allow a categorical formulation of causal intervention, which creates sub-models. Limits and colimits allow causal diagrams of arbitrary complexity to be solved", using a novel interpretation of causal approximation. Exponential objects enable reasoning about equivalence classes of operations on causal models, such as covered edge reversal and causal homotopy. Analogous to structural causal models (SCMs), TCMs are defined by a collection of functions, each defining a local autonomous" causal mechanism that assemble to induce a unique global function from exogenous to endogenous variables. Since the category of TCMs is (co)complete, which we prove in this paper, every causal diagram has a solution" in the form of a (co)limit: this implies that any arbitrary causal model can be approximated" by some global function with respect to the morphisms going into or out of the diagram. Natural transformations are crucial in measuring the quality of approximation. In addition, we show that causal interventions are modeled by subobject classifiers: any sub-model is defined by a monic arrow into its parent model. Exponential objects permit reasoning about entire classes of causal equivalences and interventions. Finally, as TCMs form a topos, they admit an internal logic defined as a Mitchell-Benabou language with an associated Kripke-Joyal semantics. We show how to reason about causal models in TCMs using this internal logic.

[227] An Efficient Application of Goal Programming to Tackle Multiobjective Problems with Recurring Fitness Landscapes

Rodrigo Lankaites Pinheiro, Dario Landa-Silva, Wasakorn Laesanklang, Ademir Aparecido Constantino

Main category: cs.AI

TL;DR: The paper proposes a method to solve highly constrained many-objective problems by leveraging similarities in fitness landscapes across problem instances, combining multiobjective algorithms with goal programming for efficiency.

Details

Motivation: Addressing the challenge of obtaining good approximation sets for highly constrained many-objective problems, especially when problem instances share similar fitness landscapes.

Method: Uses computationally expensive multiobjective algorithms to solve one problem instance, then applies Goal Programming with efficient single-objective algorithms for other instances.

Result: Demonstrates effectiveness on benchmark instances of the multiobjective vehicle routing problem with time windows, achieving good results quickly.

Conclusion: The methodology effectively combines multiobjective algorithms and goal programming to find compromise solutions efficiently in scenarios with similar fitness landscapes.

Abstract: Many real-world applications require decision-makers to assess the quality of solutions while considering multiple conflicting objectives. Obtaining good approximation sets for highly constrained many-objective problems is often a difficult task even for modern multiobjective algorithms. In some cases, multiple instances of the problem scenario present similarities in their fitness landscapes. That is, there are recurring features in the fitness landscapes when searching for solutions to different problem instances. We propose a methodology to exploit this characteristic by solving one instance of a given problem scenario using computationally expensive multiobjective algorithms to obtain a good approximation set and then using Goal Programming with efficient single-objective algorithms to solve other instances of the same problem scenario. We use three goal-based objective functions and show that on benchmark instances of the multiobjective vehicle routing problem with time windows, the methodology is able to produce good results in short computation time. The methodology allows to combine the effectiveness of state-of-the-art multiobjective algorithms with the efficiency of goal programming to find good compromise solutions in problem scenarios where instances have similar fitness landscapes.

[228] LLM-BI: Towards Fully Automated Bayesian Inference with Large Language Models

Yongchao Huang

Main category: cs.AI

TL;DR: The paper explores using LLMs to automate Bayesian inference tasks, demonstrating their ability to specify priors and model structures from natural language descriptions.

Details

Motivation: To overcome the barrier of requiring specialized expertise for specifying Bayesian priors and likelihoods.

Method: Introduces LLM-BI, a pipeline for automating Bayesian workflows, and tests it with two experiments on Bayesian linear regression.

Result: LLMs successfully elicited priors and specified full model structures from high-level descriptions.

Conclusion: LLMs show promise for automating Bayesian modeling, potentially enabling automated probabilistic programming pipelines.

Abstract: A significant barrier to the widespread adoption of Bayesian inference is the specification of prior distributions and likelihoods, which often requires specialized statistical expertise. This paper investigates the feasibility of using a Large Language Model (LLM) to automate this process. We introduce LLM-BI (Large Language Model-driven Bayesian Inference), a conceptual pipeline for automating Bayesian workflows. As a proof-of-concept, we present two experiments focused on Bayesian linear regression. In Experiment I, we demonstrate that an LLM can successfully elicit prior distributions from natural language. In Experiment II, we show that an LLM can specify the entire model structure, including both priors and the likelihood, from a single high-level problem description. Our results validate the potential of LLMs to automate key steps in Bayesian modeling, enabling the possibility of an automated inference pipeline for probabilistic programming.

[229] First Ask Then Answer: A Framework Design for AI Dialogue Based on Supplementary Questioning with Large Language Models

Chuanruo Fu, Yuncheng Du

Main category: cs.AI

TL;DR: FATA improves LLM response quality by proactively asking users for supplementary info before answering, outperforming existing methods.

Details

Motivation: Addressing LLMs' struggle with incomplete user queries by enhancing completeness and user participation.

Method: Proposes FATA: LLMs generate clarifying questions upfront, integrate user answers, and use sophisticated prompting for better responses.

Result: FATA outperforms baseline by 40% and shows 8% better stability than expert prompts.

Conclusion: FATA effectively scaffolds user expression, improving query completeness and response quality.

Abstract: Large Language Models (LLMs) often struggle to deliver accurate and actionable answers when user-provided information is incomplete or ill-specified. We propose a new interaction paradigm, First Ask Then Answer (FATA), in which, through prompt words, LLMs are guided to proactively generate multidimensional supplementary questions for users prior to response generation. Subsequently, by integrating user-provided supplementary information with the original query through sophisticated prompting techniques, we achieve substantially improved response quality and relevance. In contrast to existing clarification approaches – such as the CLAM framework oriented to ambiguity and the self-interrogation Self-Ask method – FATA emphasizes completeness (beyond mere disambiguation) and user participation (inviting human input instead of relying solely on model-internal reasoning). It also adopts a single-turn strategy: all clarifying questions are produced at once, thereby reducing dialogue length and improving efficiency. Conceptually, FATA uses the reasoning power of LLMs to scaffold user expression, enabling non-expert users to formulate more comprehensive and contextually relevant queries. To evaluate FATA, we constructed a multi-domain benchmark and compared it with two controls: a baseline prompt (B-Prompt) and a context-enhanced expert prompt (C-Prompt). Experimental results show that FATA outperforms B-Prompt by approximately 40% in aggregate metrics and exhibits a coefficient of variation 8% lower than C-Prompt, indicating superior stability.

[230] What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Evgeny Kharlamov, Steffen Staab

Main category: cs.AI

TL;DR: The paper critiques current KG-RAG evaluation practices, introduces a new benchmark method, and reveals limitations in reasoning under missing knowledge.

Details

Motivation: Current KG-RAG benchmarks and evaluation metrics are flawed, masking true model performance and reasoning capabilities.

Method: Proposes a general method for benchmark construction and evaluation protocol to assess KG-RAG under knowledge incompleteness.

Result: Current KG-RAG methods show limited reasoning with missing knowledge, rely on memorization, and vary in generalization.

Conclusion: A systematic evaluation approach is needed to better assess KG-RAG methods, highlighting their current shortcomings.

Abstract: Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

[231] UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games

Timo Bertram

Main category: cs.AI

TL;DR: UrzaGPT, a domain-adapted LLM, improves drafting decisions in Magic: The Gathering, achieving 66.2% accuracy with fine-tuning, showing potential for LLMs in CCG AI.

Details

Motivation: Current AI models underperform in CCGs like Magic: The Gathering due to their complexity. UrzaGPT aims to bridge this gap using LLMs.

Method: Fine-tunes an open-weight LLM (Low-Rank Adaptation) on annotated draft logs to adapt to game expansions and improve drafting decisions.

Result: UrzaGPT achieves 66.2% accuracy, outperforming zero-shot LLMs (43%) and showing promise for smaller models.

Conclusion: LLMs like UrzaGPT can enable performant, general, and update-friendly drafting AIs, though they don’t yet match domain-specific models.

Abstract: Collectible card games (CCGs) are a difficult genre for AI due to their partial observability, long-term decision-making, and evolving card sets. Due to this, current AI models perform vastly worse than human players at CCG tasks such as deckbuilding and gameplay. In this work, we introduce UrzaGPT, a domain-adapted large language model that recommends real-time drafting decisions in Magic: The Gathering. Starting from an open-weight LLM, we use Low-Rank Adaptation fine-tuning on a dataset of annotated draft logs. With this, we leverage the language modeling capabilities of LLM, and can quickly adapt to different expansions of the game. We benchmark UrzaGPT in comparison to zero-shot LLMs and the state-of-the-art domain-specific model. Untuned, small LLMs like Llama-3-8B are completely unable to draft, but the larger GPT-4o achieves a zero-shot performance of 43%. Using UrzaGPT to fine-tune smaller models, we achieve an accuracy of 66.2% using only 10,000 steps. Despite this not reaching the capability of domain-specific models, we show that solely using LLMs to draft is possible and conclude that using LLMs can enable performant, general, and update-friendly drafting AIs in the future.

[232] Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning

Masataro Asai

Main category: cs.AI

TL;DR: The paper proposes a bilevel MCTS modification and Tree Collapsing to improve the efficiency of node selection in classical planning, achieving amortized O(1) runtime.

Details

Motivation: MCTS in classical planning suffers from high runtime for node selection due to large search depths, unlike game tree search where this cost is negligible.

Method: Introduces a bilevel MCTS modification with a best-first search from selected leaf nodes and Tree Collapsing to reduce action selection steps.

Result: Achieves amortized O(1) runtime for node selection, matching traditional queue-based OPEN lists, and enhances performance.

Conclusion: The proposed methods significantly improve MCTS efficiency in classical planning by addressing the node selection bottleneck.

Abstract: We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with $N$ nodes has $O(1)$ runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires $O(\log N)$, which roughly corresponds to the search depth $d$. In classical planning, $d$ is arbitrarily large (e.g., $2^k-1$ in $k$-disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because $d$ is inherently limited by the game (e.g., $d\leq 361$ in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to $d$, which achieves amortized $O(1)$ runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance.

[233] Solver-Aided Expansion of Loops to Avoid Generate-and-Test

Niklas Dewally, Özgür Akgün

Main category: cs.AI

TL;DR: A method to optimize constraint model compilation by avoiding full enumeration of loop combinations, using a solver to compute only necessary constraints, improving efficiency.

Details

Motivation: Standard approaches to unrolling loops in constraint modelling languages are inefficient for problems where most combinations are irrelevant.

Method: Uses a solver to compute only the required combinations of induction variables, avoiding full enumeration.

Result: Produces identical models to conventional flattening but with significantly faster compilation.

Conclusion: The method improves efficiency in translating high-level models into solver-ready form, especially for large domains with selective preconditions.

Abstract: Constraint modelling languages like MiniZinc and Essence rely on unrolling loops (in the form of quantified expressions and comprehensions) during compilation. Standard approaches generate all combinations of induction variables and use partial evaluation to discard those that simplify to identity elements of associative-commutative operators (e.g. true for conjunction, 0 for summation). This can be inefficient for problems where most combinations are ultimately irrelevant. We present a method that avoids full enumeration by using a solver to compute only the combinations required to generate the final set of constraints. The resulting model is identical to that produced by conventional flattening, but compilation can be significantly faster. This improves the efficiency of translating high-level user models into solver-ready form, particularly when induction variables range over large domains with selective preconditions.

[234] OverFill: Two-Stage Models for Efficient Language Model Decoding

Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush

Main category: cs.AI

TL;DR: OverFill decouples prefill and decode stages in LLMs to optimize efficiency, improving generation quality with minimal latency overhead.

Details

Motivation: High inference costs in LLMs due to uniform handling of prefill (compute-bound) and decode (memory-bound) stages.

Method: Proposes OverFill: uses full model for prefill, switches to dense pruned model for decode, leveraging parallel processing.

Result: Outperforms pruned models by 79.2-83.2%, matches performance of same-sized models with less training data.

Conclusion: OverFill optimizes accuracy-efficiency tradeoffs in LLM inference, offering practical deployment benefits.

Abstract: Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.

[235] A Fast GRASP Metaheuristic for the Trigger Arc TSP with MIP-Based Construction and Multi-Neighborhood Local Search

Joan Salvà Soler, Grégoire de Lambertye

Main category: cs.AI

TL;DR: A GRASP-based metaheuristic for the Trigger Arc Traveling Salesman Problem (TA-TSP) achieves near-optimal solutions efficiently, outperforming Gurobi in some cases.

Details

Motivation: The TA-TSP models dynamic arc costs in scenarios like warehouse operations, requiring efficient solutions for real-time routing.

Method: Combines construction heuristics (using MIP) with multi-neighborhood local search (2-Opt, Swap, Relocate).

Result: Achieved 0.77% and 0.40% optimality gaps in MESS 2024, and 11.3% better solutions than Gurobi on synthetic data.

Conclusion: The method is effective for real-time routing with state-dependent costs, as shown by its MESS 2024 performance.

Abstract: The Trigger Arc Traveling Salesman Problem (TA-TSP) extends the classical TSP by introducing dynamic arc costs that change when specific \textit{trigger} arcs are traversed, modeling scenarios such as warehouse operations with compactable storage systems. This paper introduces a GRASP-based metaheuristic that combines multiple construction heuristics with a multi-neighborhood local search. The construction phase uses mixed-integer programming (MIP) techniques to transform the TA-TSP into a sequence of tailored TSP instances, while the improvement phase applies 2-Opt, Swap, and Relocate operators. Computational experiments on MESS 2024 competition instances achieved average optimality gaps of 0.77% and 0.40% relative to the best-known solutions within a 60-second limit. On smaller, synthetically generated datasets, the method produced solutions 11.3% better than the Gurobi solver under the same time constraints. The algorithm finished in the top three at MESS 2024, demonstrating its suitability for real-time routing applications with state-dependent travel costs.

[236] SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang

Main category: cs.AI

TL;DR: SEAgent is a self-evolving framework for computer-use agents (CUAs) that autonomously learns and adapts to novel software environments through experiential learning, outperforming existing models.

Details

Motivation: Existing large vision-language models (LVLMs) struggle with novel and specialized software due to reliance on human-labeled data, prompting the need for autonomous learning solutions.

Method: SEAgent employs experiential learning, a World State Model for trajectory assessment, and a Curriculum Generator for task progression. It updates policies via adversarial imitation and Group Relative Policy Optimization (GRPO).

Result: SEAgent improves success rates by 23.2% (from 11.3% to 34.5%) over UI-TARS across five novel software environments.

Conclusion: SEAgent demonstrates the potential of autonomous evolution for CUAs, achieving superior performance without human annotations.

Abstract: Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent’s policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

[237] Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback

Parker Whitfill, Stewy Slocum

Main category: cs.AI

TL;DR: Ordinal preference data is insufficient for optimal LLM alignment; cardinal feedback is needed. A new dataset of 25,000 cardinal judgments improves model performance.

Details

Motivation: Existing alignment methods rely on ordinal preferences, which lack the granularity to resolve tradeoffs and identify the most preferred model.

Method: Collect cardinal feedback via willingness-to-pay elicitations and incorporate it into preference fine-tuning.

Result: Models using cardinal feedback outperform ordinal-only methods on benchmarks like Arena-Hard.

Conclusion: Cardinal feedback is essential for optimal LLM alignment, as ordinal data alone cannot systematically recover the best model.

Abstract: Alignment techniques for LLMs rely on optimizing preference-based objectives – where these preferences are typically elicited as ordinal, binary choices between responses. Recent work has focused on improving label quality or mitigating particular biases, but we identify a more fundamental limitation: these methods collect the wrong kind of data. We prove an impossibility result: no algorithm relying solely on ordinal comparisons can systematically recover the most preferred model. Intuitively, ordinal data lacks the information needed to resolve tradeoffs – e.g., fixing a factual error on one prompt versus improving style on another. We show that selecting the optimal model requires recovering preferences over \emph{models} (rather than just responses), which can only be identified given cardinal feedback about response quality. To address this, we collect and publicly release a dataset of 25,000 cardinal judgments using willingness-to-pay elicitations, a well-established tool from experimental economics. Empirically, we find that incorporating cardinal feedback into preference fine-tuning allows models to prioritize high-impact improvements and outperform ordinal-only methods on downstream benchmarks, such as Arena-Hard.

[238] POMO+: Leveraging starting nodes in POMO for solving Capacitated Vehicle Routing Problem

Szymon Jakubicz, Karol Kuźniak, Jan Wawszczak, Paweł Gora

Main category: cs.AI

TL;DR: POMO+ improves POMO by leveraging initial nodes for faster convergence and better results in combinatorial problems like VRP.

Details

Motivation: Existing RL methods like POMO show promise but have room for improvement, especially in solving combinatorial problems such as VRP.

Method: Enhanced POMO (POMO+) by using initial nodes more effectively for informed solution-finding.

Result: POMO+ converges faster and achieves better results, validated on CVRPLIB for instances with up to 100 customers.

Conclusion: POMO+ advances RL-based combinatorial problem-solving, with potential for further field advancements.

Abstract: In recent years, reinforcement learning (RL) methods have emerged as a promising approach for solving combinatorial problems. Among RL-based models, POMO has demonstrated strong performance on a variety of tasks, including variants of the Vehicle Routing Problem (VRP). However, there is room for improvement for these tasks. In this work, we improved POMO, creating a method (\textbf{POMO+}) that leverages the initial nodes to find a solution in a more informed way. We ran experiments on our new model and observed that our solution converges faster and achieves better results. We validated our models on the CVRPLIB dataset and noticed improvements in problem instances with up to 100 customers. We hope that our research in this project can lead to further advancements in the field.

[239] Large Language Models as Oracles for Ontology Alignment

Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d’Avila Garcez

Main category: cs.AI

TL;DR: The paper explores using Large Language Models (LLMs) to validate uncertain ontology alignments, reducing reliance on expensive human experts.

Details

Motivation: Human involvement in ontology alignment is costly, especially for large ontologies, prompting the need for alternatives like LLMs.

Method: The study evaluates LLMs using ontology-driven prompts on OAEI tasks and compares results to simulated Oracles.

Result: LLMs show potential as a viable alternative for validating uncertain alignments, though performance varies.

Conclusion: LLMs can supplement human experts in ontology alignment, particularly for uncertain cases, but further refinement is needed.

Abstract: Ontology alignment plays a crucial role in integrating diverse data sources across domains. There is a large plethora of systems that tackle the ontology alignment problem, yet challenges persist in producing highly quality correspondences among a set of input ontologies. Human-in-the-loop during the alignment process is essential in applications requiring very accurate mappings. User involvement is, however, expensive when dealing with large ontologies. In this paper, we explore the feasibility of using Large Language Models (LLM) as an alternative to the domain expert. The use of the LLM focuses only on the validation of the subset of correspondences where an ontology alignment system is very uncertain. We have conducted an extensive evaluation over several matching tasks of the Ontology Alignment Evaluation Initiative (OAEI), analysing the performance of several state-of-the-art LLMs using different ontology-driven prompt templates. The LLM results are also compared against simulated Oracles with variable error rates.

[240] GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

Main category: cs.AI

TL;DR: GVGAI-LLM is a video game benchmark for evaluating LLMs’ reasoning and problem-solving, revealing their limitations in spatial reasoning and planning.

Details

Motivation: To assess LLMs' capabilities in diverse, arcade-style tasks beyond traditional benchmarks, highlighting their shortcomings in spatial and logical reasoning.

Method: Uses a game description language for rapid game creation, ASCII representations for efficiency, and interpretable metrics like step ratio and efficiency.

Result: LLMs show persistent spatial and logical errors, with partial improvements from structured prompting and spatial grounding.

Conclusion: GVGAI-LLM offers a reproducible testbed for advancing LLM research, focusing on agentic behavior and contextual reasoning.

Abstract: We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model’s ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning.

[241] SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering

Arshia Ilaty, Hossein Shirazi, Hajar Homayouni

Main category: cs.AI

TL;DR: SynLLM is a framework for generating high-quality synthetic medical data using LLMs with structured prompts, evaluated for statistical fidelity, clinical consistency, and privacy.

Details

Motivation: Restricted access to real medical data due to privacy concerns hinders healthcare research; synthetic data is a viable alternative but lacks realism and privacy safeguards.

Method: SynLLM uses 20 open-source LLMs (e.g., LLaMA, Mistral) with four prompt types (example-driven to rule-based) to generate synthetic data without fine-tuning, and evaluates it across multiple dimensions.

Result: Rule-based prompts achieve the best balance between data quality and privacy. SynLLM demonstrates LLMs can produce clinically plausible and privacy-aware synthetic data.

Conclusion: SynLLM enables safer and more effective synthetic data sharing in healthcare research by leveraging well-designed prompts and robust evaluation.

Abstract: Access to real-world medical data is often restricted due to privacy regulations, posing a significant barrier to the advancement of healthcare research. Synthetic data offers a promising alternative; however, generating realistic, clinically valid, and privacy-conscious records remains a major challenge. Recent advancements in Large Language Models (LLMs) offer new opportunities for structured data generation; however, existing approaches frequently lack systematic prompting strategies and comprehensive, multi-dimensional evaluation frameworks. In this paper, we present SynLLM, a modular framework for generating high-quality synthetic medical tabular data using 20 state-of-the-art open-source LLMs, including LLaMA, Mistral, and GPT variants, guided by structured prompts. We propose four distinct prompt types, ranging from example-driven to rule-based constraints, that encode schema, metadata, and domain knowledge to control generation without model fine-tuning. Our framework features a comprehensive evaluation pipeline that rigorously assesses generated data across statistical fidelity, clinical consistency, and privacy preservation. We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke, using 20 open-source LLMs. Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance. SynLLM establishes that, when guided by well-designed prompts and evaluated with robust, multi-metric criteria, LLMs can generate synthetic medical data that is both clinically plausible and privacy-aware, paving the way for safer and more effective data sharing in healthcare research.

[242] UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss

Zhichao Wang, Xinhai Chen, Qinglin Wang, Xiang Gao, Qingyang Zhang, Menghan Jia, Xiang Zhang, Jie Liu

Main category: cs.AI

TL;DR: UGM2N is an unsupervised, generalizable mesh movement network for PDEs, improving accuracy and efficiency without pre-adapted meshes.

Details

Motivation: Traditional mesh movement methods are computationally complex and inflexible, while supervised learning lacks zero-shot generalization.

Method: UGM2N uses unsupervised geometric feature learning and a physics-constrained M-Uniform loss for mesh equidistribution.

Result: UGM2N outperforms existing methods, generalizing across PDEs and mesh geometries, ensuring error reduction and scalability.

Conclusion: UGM2N offers a robust, equation-agnostic solution for efficient mesh adaptation in PDE simulations.

Abstract: Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh topologies.In this paper, we present an Unsupervised and Generalizable Mesh Movement Network (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal level.Experimental results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling.

[243] AgriGPT: a Large Language Model Ecosystem for Agriculture

Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, Yong He, Runhe Huang, Shijian Li

Main category: cs.AI

TL;DR: AgriGPT is a specialized LLM ecosystem for agriculture, addressing domain-specific challenges with curated datasets, retrieval-augmented generation, and a comprehensive benchmark.

Details

Motivation: The rapid progress of LLMs lacks domain-specific applications in agriculture, necessitating tailored models, datasets, and evaluation frameworks.

Method: AgriGPT uses a multi-agent data engine to create Agri-342K (QA dataset) and Tri-RAG (three-channel retrieval framework) for factual grounding.

Result: AgriGPT outperforms general-purpose LLMs in domain adaptation and reasoning, validated by AgriBench-13K.

Conclusion: AgriGPT offers a modular, extensible framework for specialized LLMs, promoting open research and empowering underserved agricultural communities.

Abstract: Despite the rapid progress of Large Language Models (LLMs), their application in agriculture remains limited due to the lack of domain-specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain-specialized LLM ecosystem for agricultural usage. At its core, we design a multi-agent scalable data engine that systematically compiles credible data sources into Agri-342K, a high-quality, standardized question-answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy-makers. To enhance factual grounding, we employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning, thereby improving the LLM’s reasoning reliability. For comprehensive evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general-purpose LLMs on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible LLM ecosystem for agriculture, comprising structured data construction, retrieval-enhanced generation, and domain-specific evaluation. This work provides a generalizable framework for developing scientific and industry-specialized LLMs. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research.

[244] Diminution: On Reducing the Size of Grounding ASP Programs

HuanYu Yang, Fengming Zhu, YangFan Wu, Jianmin Ji

Main category: cs.AI

TL;DR: The paper introduces ‘diminution’ to reduce the grounding bottleneck in Answer Set Programming (ASP) by selecting subsets of the Herbrand universe, improving performance and grounding file size.

Details

Motivation: The grounding bottleneck in ASP, caused by large Herbrand universes, necessitates a formal and generalizable strategy to improve grounding performance.

Method: The paper defines diminution, analyzes its properties, and uses an encoding for ASP solvers to evaluate subsets. It integrates with existing grounders via domain predicates.

Result: Experiments show up to 70% faster grounding and 85% smaller grounding files across five benchmarks.

Conclusion: Diminution is a robust, general-purpose solution to the ASP grounding bottleneck.

Abstract: Answer Set Programming (ASP) is often hindered by the grounding bottleneck: large Herbrand universes generate ground programs so large that solving becomes difficult. Many methods employ ad-hoc heuristics to improve grounding performance, motivating the need for a more formal and generalizable strategy. We introduce the notion of diminution, defined as a selected subset of the Herbrand universe used to generate a reduced ground program before solving. We give a formal definition of diminution, analyze its key properties, and study the complexity of identifying it. We use a specific encoding that enables off-the-shelf ASP solver to evaluate candidate subsets. Our approach integrates seamlessly with existing grounders via domain predicates. In extensive experiments on five benchmarks, applying diminutions selected by our strategy yields significant performance improvements, reducing grounding time by up to 70% on average and decreasing the size of grounding files by up to 85%. These results demonstrate that leveraging diminutions constitutes a robust and general-purpose approach for alleviating the grounding bottleneck in ASP.

[245] P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records

Naama Kashani, Mira Cohen, Uri Shaham

Main category: cs.AI

TL;DR: A novel personalized, online, and cost-aware feature selection framework for EHR data addresses sparsity and heterogeneity challenges, optimizing diagnostic confidence and resource use.

Details

Motivation: Extracting insights from complex EHR data is difficult due to sparsity, heterogeneity, and feature costs, requiring tailored solutions.

Method: Proposes a personalized, online feature selection framework for EHR data, incorporating budgetary constraints and feature variability costs.

Result: The framework manages sparse, multimodal data effectively, supporting scalable performance in healthcare.

Conclusion: The method aids physicians in decision-making by prioritizing informative features within budgets, enhancing diagnostics and resource efficiency.

Abstract: Electronic Health Records (EHR) have revolutionized healthcare by digitizing patient data, improving accessibility, and streamlining clinical workflows. However, extracting meaningful insights from these complex and multimodal datasets remains a significant challenge for researchers. Traditional feature selection methods often struggle with the inherent sparsity and heterogeneity of EHR data, especially when accounting for patient-specific variations and feature costs in clinical applications. To address these challenges, we propose a novel personalized, online and cost-aware feature selection framework tailored specifically for EHR datasets. The features are aquired in an online fashion for individual patients, incorporating budgetary constraints and feature variability costs. The framework is designed to effectively manage sparse and multimodal data, ensuring robust and scalable performance in diverse healthcare contexts. A primary application of our proposed method is to support physicians’ decision making in patient screening scenarios. By guiding physicians toward incremental acquisition of the most informative features within budget constraints, our approach aims to increase diagnostic confidence while optimizing resource utilization.

[246] Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

Vishakha Lall, Yisi Liu

Main category: cs.AI

TL;DR: The paper introduces Prompt-and-Check, a lightweight method using open-source LLMs to evaluate procedural communication compliance in simulation-based training, demonstrating its effectiveness in maritime case studies.

Details

Motivation: Accurate evaluation of compliance in safety-critical domains is crucial for operational competence, but traditional methods may be resource-intensive.

Method: The approach uses prompt-based inference with LLMs (e.g., LLama 2 7B, Mistral 7B) to assess checklist fulfillment from transcribed verbal exchanges, tested in a maritime simulation.

Result: The method achieves effective context-aware reasoning without task-specific training, validated by expert-annotated ground truth.

Conclusion: LLMs like Prompt-and-Check can enhance debriefing, feedback, and automated assessment in training environments.

Abstract: Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.

[247] Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem

Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, Daniele Vigo

Main category: cs.AI

TL;DR: A hybrid optimization solver combining machine learning (GNNs) and metaheuristics improves performance for the Capacitated Vehicle Routing Problem (CVRP), scaling to large instances.

Details

Motivation: To enhance metaheuristic algorithms' performance in solving CVRP by integrating machine learning for strategic node selection.

Method: Uses a Node-Destroyer Model with GNNs to guide Large Neighborhood Search (LNS) in metaheuristics, reducing search space complexity.

Result: Improves solution quality for CVRP benchmarks and scales to 30,000 nodes without retraining.

Conclusion: The hybrid approach effectively boosts metaheuristic performance and scalability for CVRP.

Abstract: In this research, we propose an iterative learning hybrid optimization solver developed to strengthen the performance of metaheuristic algorithms in solving the Capacitated Vehicle Routing Problem (CVRP). The iterative hybrid mechanism integrates the proposed Node-Destroyer Model, a machine learning hybrid model that utilized Graph Neural Networks (GNNs) such identifies and selects customer nodes to guide the Large Neighborhood Search (LNS) operator within the metaheuristic optimization frameworks. This model leverages the structural properties of the problem and solution that can be represented as a graph, to guide strategic selections concerning node removal. The proposed approach reduces operational complexity and scales down the search space involved in the optimization process. The hybrid approach is applied specifically to the CVRP and does not require retraining across problem instances of different sizes. The proposed hybrid mechanism is able to improve the performance of baseline metaheuristic algorithms. Our approach not only enhances the solution quality for standard CVRP benchmarks but also proves scalability on very large-scale instances with up to 30,000 customer nodes. Experimental evaluations on benchmark datasets show that the proposed hybrid mechanism is capable of improving different baseline algorithms, achieving better quality of solutions under similar settings.

[248] Aryabhata: An exam-focused language model for JEE Math

Ritvik Rastogi, Sachin Dharashivkar, Sandeep Varma

Main category: cs.AI

TL;DR: Aryabhata 1.0 is a 7B parameter math reasoning model optimized for the JEE exam, combining open-weight models, SFT, and RLVR for superior performance and pedagogical utility.

Details

Motivation: Current LLMs are unsuitable for education; Aryabhata aims to bridge this gap for Indian exams like JEE.

Method: Merges open-weight models, uses SFT with curriculum learning, and applies RLVR with novel exploration strategies.

Result: Outperforms existing models on JEE and other benchmarks, offering step-by-step reasoning.

Conclusion: Aryabhata is released as an open-source foundation model to improve educational outcomes.

Abstract: We present Aryabhata 1.0, a compact 7B parameter math reasoning model optimized for the Indian academic exam, the Joint Entrance Examination (JEE). Despite rapid progress in large language models (LLMs), current models often remain unsuitable for educational use. Aryabhata 1.0 is built by merging strong open-weight reasoning models, followed by supervised fine-tuning (SFT) with curriculum learning on verified chain-of-thought (CoT) traces curated through best-of-$n$ rejection sampling. To further boost performance, we apply reinforcement learning with verifiable rewards (RLVR) using A2C objective with group-relative advantage estimation along with novel exploration strategies such as Adaptive Group Resizing and Temperature Scaling. Evaluated on both in-distribution (JEE Main 2025) and out-of-distribution (MATH, GSM8K) benchmarks, Aryabhata outperforms existing models in accuracy and efficiency, while offering pedagogically useful step-by-step reasoning. We release Aryabhata as a foundation model to advance exam-centric, open-source small language models. This marks our first open release for community feedback (https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0); PW is actively training future models to further improve learning outcomes for students.

[249] STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

Main category: cs.AI

TL;DR: STELAR-Vision improves VLMs by introducing topology-aware reasoning, achieving higher accuracy and efficiency compared to traditional CoT methods.

Details

Motivation: VLMs struggle with complex multimodal tasks and verbose outputs due to reliance on CoT reasoning, prompting the need for alternative topologies.

Method: STELAR-Vision uses TopoAug for synthetic data with diverse topologies, supervised fine-tuning, reinforcement learning, and Frugal Learning to reduce output length.

Result: STELAR-Vision outperforms base and larger models (e.g., Qwen2VL-72B-Instruct) by up to 9.7% and 7.3%, respectively, and excels in OOD benchmarks.

Conclusion: The framework demonstrates strong generalization and efficiency, with released datasets and upcoming code availability.

Abstract: Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available.

Yuwei Yan, Jinghua Piao, Xiaochong Lan, Chenyang Shao, Pan Hui, Yong Li

Main category: cs.AI

TL;DR: A theory-informed framework for LLM-based social agents is proposed, integrating motivation, action planning, and learning to improve generalization and realism in social simulations.

Details

Motivation: Addressing the lack of a unified framework for LLM-based social agents, which limits generalization and realism across contexts.

Method: A framework grounded in Social Cognition Theory with three modules: motivation, action planning, and learning.

Result: Agents achieve 75% lower deviation from real-world data and show 1.5-3.2x higher errors when modules are removed.

Conclusion: The framework enhances flexibility and realism in social agent behaviors, validated by experiments.

Abstract: Recent advances in large language models have demonstrated strong reasoning and role-playing capabilities, opening new opportunities for agent-based social simulations. However, most existing agents’ implementations are scenario-tailored, without a unified framework to guide the design. This lack of a general social agent limits their ability to generalize across different social contexts and to produce consistent, realistic behaviors. To address this challenge, we propose a theory-informed framework that provides a systematic design process for LLM-based social agents. Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning. These modules jointly enable agents to reason about their goals, plan coherent actions, and adapt their behavior over time, leading to more flexible and contextually appropriate responses. Comprehensive experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions, achieving up to 75% lower deviation from real-world behavioral data across multiple fidelity metrics compared to classical generative baselines. Ablation studies further show that removing motivation, planning, or learning modules increases errors by 1.5 to 3.2 times, confirming their distinct and essential contributions to generating realistic and coherent social behaviors.

[251] Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance

Dongwook Choi, Taeyoon Kwon, Dongil Yang, Hyojun Kim, Jinyoung Yeo

Main category: cs.AI

TL;DR: The paper proposes a memory-augmented AR agent framework to enhance personalized task assistance by leveraging historical user interactions and spatiotemporal contexts.

Details

Motivation: Current AR agents lack the ability to handle complex multi-step scenarios due to their inability to retain and reason over long-term user experiences and preferences.

Method: The framework includes four modules: Perception (multimodal sensor processing), Memory (persistent spatiotemporal storage), Spatiotemporal Reasoning (context synthesis), and Actuator (AR communication).

Result: A conceptual framework and implementation roadmap are presented, along with potential applications, to demonstrate practical utility.

Conclusion: The work aims to inspire future research for more intelligent AR systems that integrate user history with adaptive, context-aware assistance.

Abstract: Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle with complex multi-step scenarios that require understanding and leveraging user’s long-term experiences and preferences. This limitation stems from their inability to capture, retain, and reason over historical user interactions in spatiotemporal contexts. To address these challenges, we propose a conceptual framework for memory-augmented AR agents that can provide personalized task assistance by learning from and adapting to user-specific experiences over time. Our framework consists of four interconnected modules: (1) Perception Module for multimodal sensor processing, (2) Memory Module for persistent spatiotemporal experience storage, (3) Spatiotemporal Reasoning Module for synthesizing past and present contexts, and (4) Actuator Module for effective AR communication. We further present an implementation roadmap, a future evaluation strategy, a potential target application and use cases to demonstrate the practical applicability of our framework across diverse domains. We aim for this work to motivate future research toward developing more intelligent AR systems that can effectively bridge user’s interaction history with adaptive, context-aware task assistance.

[252] A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions

Amir Mohammad Salehoof, Ali Ramezani, Yadollah Yaghoobzadeh, Majid Nili Ahmadabadi

Main category: cs.AI

TL;DR: The paper introduces a function-based taxonomy for knowledge editing in large language models (LLMs), complementing existing mechanism-focused surveys. It analyzes how editing effectiveness varies with knowledge types and reviews methods, evaluations, and challenges.

Details

Motivation: To address the gap in existing surveys that focus on editing mechanisms but overlook the function of the knowledge being edited, aiming for a more holistic understanding.

Method: Introduces a function-based taxonomy to categorize knowledge types (factual, temporal, conceptual, commonsense, social) and reviews editing methods along these and mechanism axes.

Result: Provides a comprehensive mapping of the knowledge editing landscape, highlighting strengths, limitations, and effectiveness based on knowledge type.

Conclusion: The survey formalizes the problem, reviews evaluation tasks and datasets, and identifies open challenges and future directions for knowledge editing in LLMs.

Abstract: Large language models (LLMs) acquire vast knowledge from large text corpora, but this information can become outdated or inaccurate. Since retraining is computationally expensive, knowledge editing offers an efficient alternative – modifying internal knowledge without full retraining. These methods aim to update facts precisely while preserving the model’s overall capabilities. While existing surveys focus on the mechanism of editing (e.g., parameter changes vs. external memory), they often overlook the function of the knowledge being edited. This survey introduces a novel, complementary function-based taxonomy to provide a more holistic view. We examine how different mechanisms apply to various knowledge types – factual, temporal, conceptual, commonsense, and social – highlighting how editing effectiveness depends on the nature of the target knowledge. By organizing our review along these two axes, we map the current landscape, outline the strengths and limitations of existing methods, define the problem formally, survey evaluation tasks and datasets, and conclude with open challenges and future directions.

[253] GRainsaCK: a Comprehensive Software Library for Benchmarking Explanations of Link Prediction Tasks on Knowledge Graphs

Roberto Barile, Claudia d’Amato, Nicola Fanizzi

Main category: cs.AI

TL;DR: GRainsaCK is a reusable software resource for benchmarking explanations in link prediction tasks, addressing the lack of standard evaluation protocols.

Details

Motivation: The incompleteness of Knowledge Graphs and the lack of comprehensibility in embedding-based link prediction methods necessitate explainable solutions. However, evaluating explanations is challenging due to missing standards.

Method: GRainsaCK streamlines benchmarking tasks, from model training to explanation evaluation, using a modular and extensible design.

Result: The tool provides a reusable, documented, and tutorial-supported resource for standardized explanation evaluation.

Conclusion: GRainsaCK fills a critical gap in evaluating explanations for link prediction, promoting modularity and reuse.

Abstract: Since Knowledge Graphs are often incomplete, link prediction methods are adopted for predicting missing facts. Scalable embedding based solutions are mostly adopted for this purpose, however, they lack comprehensibility, which may be crucial in several domains. Explanation methods tackle this issue by identifying supporting knowledge explaining the predicted facts. Regretfully, evaluating/comparing quantitatively the resulting explanations is challenging as there is no standard evaluation protocol and overall benchmarking resource. We fill this important gap by proposing GRainsaCK, a reusable software resource that fully streamlines all the tasks involved in benchmarking explanations, i.e., from model training to evaluation of explanations along the same evaluation protocol. Moreover, GRainsaCK furthers modularity/extensibility by implementing the main components as functions that can be easily replaced. Finally, fostering its reuse, we provide extensive documentation including a tutorial.

[254] Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation

Yuechen Wang, Yuming Qiao, Dan Meng, Jun Yang, Haonan Lu, Zhenyu Yang, Xudong Zhang

Main category: cs.AI

TL;DR: E-Agent improves multimodal retrieval-augmented generation (mRAG) with dynamic planning and execution, outperforming existing methods by 13% in accuracy and reducing redundant searches by 37%.

Details

Motivation: Addressing rigid retrieval strategies and under-utilization of visual information in mRAG systems for real-world applications like news analysis.

Method: Proposes E-Agent, featuring a mRAG planner for dynamic tool orchestration and a task executor for optimized workflows, along with the RemPlan benchmark for evaluation.

Result: E-Agent achieves a 13% accuracy gain over state-of-the-art methods and reduces redundant searches by 37%.

Conclusion: E-Agent’s dynamic planning and execution framework significantly enhances mRAG performance, validated by the novel RemPlan benchmark.

Abstract: Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark’s explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent’s superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.

[255] Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition

Mustafa Akben, Vinayaka Gude, Haya Ajjan

Main category: cs.AI

TL;DR: MLLMs outperform humans in individual emotion recognition but are surpassed by collective human intelligence. Human-AI collaboration yields the highest accuracy.

Details

Motivation: To explore AI's emotional intelligence, specifically MLLMs, and compare their emotion recognition abilities with humans.

Method: Evaluated MLLMs and humans using the RMET and MRMET tests, comparing individual and aggregated performances.

Result: MLLMs outperform humans individually, but human groups surpass aggregated MLLM predictions. Collaboration between humans and MLLMs achieves the highest accuracy.

Conclusion: Human collective intelligence and human-AI collaboration are key for advancing emotionally intelligent AI systems.

Abstract: The ability to discern subtle emotional cues is fundamental to human social intelligence. As artificial intelligence (AI) becomes increasingly common, AI’s ability to recognize and respond to human emotions is crucial for effective human-AI interactions. In particular, whether such systems can match or surpass human experts remains to be seen. However, the emotional intelligence of AI, particularly multimodal large language models (MLLMs), remains largely unexplored. This study evaluates the emotion recognition abilities of MLLMs using the Reading the Mind in the Eyes Test (RMET) and its multiracial counterpart (MRMET), and compares their performance against human participants. Results show that, on average, MLLMs outperform humans in accurately identifying emotions across both tests. This trend persists even when comparing performance across low, medium, and expert-level performing groups. Yet when we aggregate independent human decisions to simulate collective intelligence, human groups significantly surpass the performance of aggregated MLLM predictions, highlighting the wisdom of the crowd. Moreover, a collaborative approach (augmented intelligence) that combines human and MLLM predictions achieves greater accuracy than either humans or MLLMs alone. These results suggest that while MLLMs exhibit strong emotion recognition at the individual level, the collective intelligence of humans and the synergistic potential of human-AI collaboration offer the most promising path toward effective emotional AI. We discuss the implications of these findings for the development of emotionally intelligent AI systems and future research directions.

[256] Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

Main category: cs.AI

TL;DR: A dual-agent framework separates reasoning and coding tasks to reduce cognitive load, improving accuracy over single-agent systems.

Details

Motivation: Single-agent systems for mathematical reasoning impose cognitive load by interleaving reasoning and coding, reducing accuracy.

Method: Proposes a dual-agent framework: a Reasoning Agent for problem decomposition and a Code Agent for code tasks. Training combines imitation and reinforcement learning.

Result: The dual-agent system reduces cognitive interference, leading to more stable and accurate reasoning-coding coordination.

Conclusion: Decoupling reasoning and coding roles enhances performance in tool-integrated mathematical reasoning systems.

Abstract: Current tool-integrated mathematical reasoning systems often adopt a single-agent paradigm, where one large language model handles problem reasoning, code generation, and code execution in an integrated workflow. While this design eases coordination, we hypothesize that it imposes cognitive load interference, as the agent must interleave long-horizon reasoning with precise program synthesis. We validate this hypothesis through a controlled comparison between a reasoning-only agent and a reasoning-plus-code agent, finding that the latter produces significantly fewer correct reasoning paths despite having tool-calling capabilities. To address this, we propose a dual-agent hybrid framework: a Reasoning Agent performs stepwise problem decomposition, and a Code Agent handles code generation and execution. Training combines imitation learning and reinforcement learning: the Code Agent receives strong rewards for matching intermediate ground-truth programs and weaker rewards for valid execution, while the Reasoning Agent is optimized chiefly via final-answer accuracy using advantage estimation to credit intermediate steps. This decoupled role design reduces cognitive interference and promotes stable reasoning-coding coordination.

[257] Compass-Thinker-7B Technical Report

Anxiang Zeng, Haibo Zhang, Kaixiang Mo, Long Zhang, Shuman Liu, Yanhui Huang, Yawen Liu, Yuepeng Sheng, Yuwei Huang

Main category: cs.AI

TL;DR: The paper introduces Compass-Thinker-7B, a model exploring Reinforcement Learning (RL) for reasoning in LLMs with reduced computational costs, using a curated dataset of 30k math problems. It outperforms same-sized RL models, achieving 40% accuracy in AIME2024.

Details

Motivation: To address the high computational costs and risks of RL on hyperscale models by developing a smaller, efficient model (Compass-Thinker-7B) that explores RL's potential for reasoning.

Method: Train Compass-Thinker-7B from an open-source model using a specialized RL pipeline and a dataset of 30k verifiable math problems, with staged difficulty distributions for efficiency.

Result: Compass-Thinker-7B demonstrates exceptional reasoning, outperforming same-sized RL models, notably achieving 40% accuracy in AIME2024.

Conclusion: The model successfully explores RL’s potential with reduced resources, offering insights for scaling RL to larger models.

Abstract: Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core technology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We propose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learning with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline. we curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with different difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model.Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.

[258] Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models

Wei Cai, Jian Zhao, Yuchu Jiang, Tianle Zhang, Xuelong Li

Main category: cs.AI

TL;DR: The paper introduces Implicit Reasoning Safety as a vulnerability in Large Vision-Language Models (LVLMs) and presents a dataset (SSUI) to address it, showing that simple In-Context Learning can mitigate these risks.

Details

Motivation: To highlight and address the safety challenges in LVLMs caused by flawed or hidden reasoning in multimodal inputs.

Method: Developed the Safe Semantics, Unsafe Interpretations (SSUI) dataset and demonstrated mitigation using In-Context Learning.

Result: Simple In-Context Learning with SSUI significantly reduces implicit multimodal threats.

Conclusion: Improving cross-modal implicit reasoning is urgently needed to enhance LVLM safety.

Abstract: Large Vision-Language Models face growing safety challenges with multimodal inputs. This paper introduces the concept of Implicit Reasoning Safety, a vulnerability in LVLMs. Benign combined inputs trigger unsafe LVLM outputs due to flawed or hidden reasoning. To showcase this, we developed Safe Semantics, Unsafe Interpretations, the first dataset for this critical issue. Our demonstrations show that even simple In-Context Learning with SSUI significantly mitigates these implicit multimodal threats, underscoring the urgent need to improve cross-modal implicit reasoning.

[259] Prospect Theory Fails for LLMs: Revealing Instability of Decision-Making under Epistemic Uncertainty

Rui Wang, Qihan Lin, Jiayu Liu, Qing Zong, Tianshi Zheng, Weiqi Wang, Yangqiu Song

Main category: cs.AI

TL;DR: The paper explores whether Prospect Theory (PT) applies to Large Language Models (LLMs) and how epistemic markers affect their decision-making. It introduces a three-stage experiment and a new evaluation framework, finding PT unreliable for LLMs under linguistic uncertainty.

Details

Motivation: To investigate if PT, a model of human decision-making under uncertainty, applies to LLMs and how epistemic markers influence their behavior.

Method: A three-stage experiment using economic questionnaires, a new evaluation framework for PT in LLMs, and incorporation of epistemic markers with empirical probability values.

Result: PT is not consistently reliable for modeling LLM decision-making, especially under diverse linguistic uncertainty.

Conclusion: The study highlights limitations of PT for LLMs and the impact of epistemic markers, providing a framework for future research.

Abstract: Prospect Theory (PT) models human decision-making under uncertainty, while epistemic markers (e.g., maybe) serve to express uncertainty in language. However, it remains largely unexplored whether Prospect Theory applies to contemporary Large Language Models and whether epistemic markers, which express human uncertainty, affect their decision-making behaviour. To address these research gaps, we design a three-stage experiment based on economic questionnaires. We propose a more general and precise evaluation framework to model LLMs’ decision-making behaviour under PT, introducing uncertainty through the empirical probability values associated with commonly used epistemic markers in comparable contexts. We then incorporate epistemic markers into the evaluation framework based on their corresponding probability values to examine their influence on LLM decision-making behaviours. Our findings suggest that modelling LLMs’ decision-making with PT is not consistently reliable, particularly when uncertainty is expressed in diverse linguistic forms. Our code is released in https://github.com/HKUST-KnowComp/MarPT.

[260] Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory

Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey

Main category: cs.AI

TL;DR: The paper introduces Intrinsic Memory Agents, a framework to improve memory consistency and task performance in multi-agent LLM systems by using structured, evolving agent-specific memories.

Details

Motivation: Addressing memory limitations in multi-agent LLM systems that hinder collaborative problem-solving due to context window constraints.

Method: Uses role-aligned memory templates to maintain specialized perspectives and task relevance, tested on PDDL dataset and a data pipeline design task.

Result: Shows a 38.6% performance improvement over existing methods and better results in scalability, reliability, usability, cost-effectiveness, and documentation.

Conclusion: Structured, intrinsic memory approaches enhance multi-agent LLM systems’ capabilities for structured planning tasks.

Abstract: Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through structured agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory templates that preserve specialized perspectives while focusing on task-relevant information. We benchmark our approach on the PDDL dataset, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing an improvement of 38.6% with the highest token efficiency. An additional evaluation is performed on a complex data pipeline design task, we demonstrate that our approach produces higher quality designs when comparing 5 metrics: scalability, reliability, usability, cost-effectiveness and documentation with additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through structured, intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.

[261] Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Shivam Dubey

Main category: cs.AI

TL;DR: The paper introduces an end-to-end system using mechanistic interpretability to identify and mitigate biases in LLMs by training probes and using steering vectors.

Details

Motivation: Addressing the risk of LLMs perpetuating harmful biases, the paper critiques traditional opaque methods and proposes a direct, interpretable approach.

Method: The method involves training probes to detect bias in model activations and computing steering vectors to adjust outputs in real-time.

Result: Probes identified bias with near-perfect accuracy, and steering vectors successfully reduced biased outputs.

Conclusion: The system offers a robust, interpretable solution for safer and more accountable LLMs.

Abstract: As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model’s internal workings. Our method involves two primary stages. First, we train linear “probes” on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \texttt{gpt2-large} demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in the model’s later layers. Second, we leverage these findings to compute “steering vectors” by contrasting the model’s activation patterns for biased and neutral statements. By adding these vectors during inference, we can actively steer the model’s generative process away from producing harmful, stereotypical, or biased content in real-time. We demonstrate the efficacy of this activation steering technique, showing that it successfully alters biased completions toward more neutral alternatives. We present our work as a robust and reproducible system that offers a more direct and interpretable approach to building safer and more accountable LLMs.

[262] A First Look at Predictability and Explainability of Pre-request Passenger Waiting Time in Ridesharing Systems

Jie Wang, Guang Wang

Main category: cs.AI

TL;DR: The paper introduces FiXGBoost, a feature interaction-based XGBoost model, to predict pre-request passenger waiting time in ridesharing systems, addressing a gap in existing research.

Details

Motivation: Pre-request waiting time prediction is crucial for trip planning and improving user experience, yet understudied compared to post-request prediction.

Method: The study uses a data-driven approach to analyze demand-supply dynamics, followed by feature engineering and the development of FiXGBoost.

Result: FiXGBoost shows good performance and high explainability on a large-scale ridesharing dataset with 30+ million trip records.

Conclusion: The work advances pre-request waiting time prediction, offering practical benefits for ridesharing platforms and users.

Abstract: Passenger waiting time prediction plays a critical role in enhancing both ridesharing user experience and platform efficiency. While most existing research focuses on post-request waiting time prediction with knowing the matched driver information, pre-request waiting time prediction (i.e., before submitting a ride request and without matching a driver) is also important, as it enables passengers to plan their trips more effectively and enhance the experience of both passengers and drivers. However, it has not been fully studied by existing works. In this paper, we take the first step toward understanding the predictability and explainability of pre-request passenger waiting time in ridesharing systems. Particularly, we conduct an in-depth data-driven study to investigate the impact of demand&supply dynamics on passenger waiting time. Based on this analysis and feature engineering, we propose FiXGBoost, a novel feature interaction-based XGBoost model designed to predict waiting time without knowing the assigned driver information. We further perform an importance analysis to quantify the contribution of each factor. Experiments on a large-scale real-world ridesharing dataset including over 30 million trip records show that our FiXGBoost can achieve a good performance for pre-request passenger waiting time prediction with high explainability.

[263] CVCM Track Circuits Pre-emptive Failure Diagnostics for Predictive Maintenance Using Deep Neural Networks

Debdeep Mukherjee, Eduardo Di Santi, Clément Lefebvre, Nenad Mijatovic, Victor Martin, Thierry Josse, Jonathan Brown, Kenza Saiah

Main category: cs.AI

TL;DR: A deep learning-based predictive maintenance framework for CVCM track circuits detects subtle anomalies early, outperforming conventional methods with 99.31% accuracy and ISO-17359 compliance.

Details

Motivation: Early detection of subtle anomalies in CVCM track circuits is crucial to prevent cascading failures, reduce downtime, and minimize revenue loss.

Method: The proposed framework uses deep neural networks to classify anomalies before they escalate, validated on 10 CVCM failure cases. It includes conformal prediction for uncertainty estimates.

Result: Achieves 99.31% accuracy, detects anomalies within 1% of onset, and provides 99% confidence with consistent class coverage.

Conclusion: The scalable and adaptable framework enhances railway operational reliability and can be extended to other track circuits and systems.

Abstract: Track circuits are critical for railway operations, acting as the main signalling sub-system to locate trains. Continuous Variable Current Modulation (CVCM) is one such technology. Like any field-deployed, safety-critical asset, it can fail, triggering cascading disruptions. Many failures originate as subtle anomalies that evolve over time, often not visually apparent in monitored signals. Conventional approaches, which rely on clear signal changes, struggle to detect them early. Early identification of failure types is essential to improve maintenance planning, minimising downtime and revenue loss. Leveraging deep neural networks, we propose a predictive maintenance framework that classifies anomalies well before they escalate into failures. Validated on 10 CVCM failure cases across different installations, the method is ISO-17359 compliant and outperforms conventional techniques, achieving 99.31% overall accuracy with detection within 1% of anomaly onset. Through conformal prediction, we provide uncertainty estimates, reaching 99% confidence with consistent coverage across classes. Given CVCMs global deployment, the approach is scalable and adaptable to other track circuits and railway systems, enhancing operational reliability.

[264] SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling

Shixuan Sun, Siyuan Liang, Ruoyu Chen, Jianjie Huang, Jingzhi Li, Xiaochun Cao

Main category: cs.AI

TL;DR: The paper introduces Source-aware Membership Audit (SMA) to attribute generated content to its sources in retrieval-augmented systems, addressing privacy leakage concerns.

Details

Motivation: Existing methods fail to reliably attribute outputs in retrieval-augmented systems, undermining accountability for privacy leaks.

Method: SMA uses zero-order optimization for attribution estimation and cross-modal techniques for image-to-text attribution.

Result: SMA enables fine-grained source attribution and membership inference in multimodal systems.

Conclusion: SMA shifts focus to content sourcing, offering a new perspective for auditing data provenance in generative systems.

Abstract: Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs) by introducing external knowledge sources. However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability To address these challenges, we propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control capabilities.To address the environmental constraints of semi-black-box auditing, we further design an attribution estimation mechanism based on zero-order optimization, which robustly approximates the true influence of input tokens on the output through large-scale perturbation sampling and ridge regression modeling. In addition, SMA introduces a cross-modal attribution technique that projects image inputs into textual descriptions via MLLMs, enabling token-level attribution in the text modality, which for the first time facilitates membership inference on image retrieval traces in MRAG systems. This work shifts the focus of membership inference from ‘whether the data has been memorized’ to ‘where the content is sourced from’, offering a novel perspective for auditing data provenance in complex generative systems.

[265] OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu

Main category: cs.AI

TL;DR: OpenCUA is an open-source framework for vision-language models (CUAs) that includes annotation tools, a large-scale dataset (AgentNet), and a scalable pipeline, achieving SOTA performance.

Details

Motivation: To address the lack of open frameworks for studying CUAs' capabilities, limitations, and risks as they mediate digital interactions.

Method: Proposes OpenCUA with (1) annotation infrastructure, (2) AgentNet dataset, and (3) a scalable pipeline for state-action pairs with reflective reasoning.

Result: OpenCUA-32B achieves 34.8% success rate on OSWorld-Verified, surpassing GPT-4o.

Conclusion: OpenCUA provides open foundations for CUA research, demonstrating strong generalization and scalability.

Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

[266] BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair

Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen

Main category: cs.AI

TL;DR: BrowseMaster is a scalable framework using a planner-executor agent pair to balance search breadth and reasoning depth, outperforming existing methods in complex information-seeking tasks.

Details

Motivation: Current LLM-based agents struggle with balancing search breadth and reasoning depth, limiting their effectiveness in information-seeking tasks.

Method: BrowseMaster employs a programmatically augmented planner-executor pair: the planner adapts search strategies, while the executor retrieves concise evidence.

Result: BrowseMaster achieves scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, outperforming baselines.

Conclusion: BrowseMaster effectively addresses the trade-off between search breadth and reasoning depth, excelling in complex information-seeking tasks.

Abstract: Effective information seeking in the vast and ever-growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)-based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi-step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner-executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long-horizon reasoning while sustaining broad and systematic exploration, overcoming the trade-off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open-source and proprietary baselines, achieving scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale.

[267] EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning

Yi Tang, Kaini Wang, Yang Chen, Guangquan Zhou

Main category: cs.AI

TL;DR: EndoAgent is a memory-guided AI agent for endoscopic analysis, integrating iterative reasoning and adaptive tool selection, outperforming existing models.

Details

Motivation: Existing AI methods lack unified coordination for complex clinical workflows in endoscopy, and AI agents' potential in this domain is underexplored.

Method: EndoAgent uses a dual-memory design for logical coherence (short-term action tracking) and enhanced reasoning (long-term experiential learning), integrating expert-designed tools.

Result: EndoAgent outperforms general and medical multimodal models, demonstrating strong flexibility and reasoning capabilities.

Conclusion: EndoAgent advances endoscopic AI by enabling sophisticated decision-making and adaptability in clinical workflows.

Abstract: Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.

[268] System~2 Reasoning for Human–AI Alignment: Generality and Adaptivity via ARC-AGI

Sejin Kim, Sundong Kim

Main category: cs.AI

TL;DR: The paper identifies gaps in transformer-based models for System 2 reasoning, proposing three research axes to improve compositional generalization and adaptivity, and suggests adapting ARC-AGI’s evaluation suite for progress tracking.

Details

Motivation: Transformer-based models lack generality and adaptivity in System 2 reasoning, hindering human-AI alignment, as evidenced by weaknesses in ARC-AGI tasks.

Method: Proposes three research axes: symbolic representation for generality, interactive feedback for adaptivity, and test-time task augmentation for balance.

Result: Demonstrates how ARC-AGI’s evaluation suite can be adapted to track progress in symbolic generality, feedback-driven adaptivity, and task-level robustness.

Conclusion: Closing gaps in transformer models requires overhauling reasoning pipelines and evaluations, with ARC-AGI serving as a guide for future robust human-AI alignment.

Abstract: Despite their broad applicability, transformer-based models still fall short in System~2 reasoning, lacking the generality and adaptivity needed for human–AI alignment. We examine weaknesses on ARC-AGI tasks, revealing gaps in compositional generalization and novel-rule adaptation, and argue that closing these gaps requires overhauling the reasoning pipeline and its evaluation. We propose three research axes: (1) Symbolic representation pipeline for compositional generality, (2) Interactive feedback-driven reasoning loop for adaptivity, and (3) Test-time task augmentation balancing both qualities. Finally, we demonstrate how ARC-AGI’s evaluation suite can be adapted to track progress in symbolic generality, feedback-driven adaptivity, and task-level robustness, thereby guiding future work on robust human–AI alignment.

[269] UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, Yizhou Wang

Main category: cs.AI

TL;DR: UnrealZoo is a collection of 100+ photo-realistic 3D virtual worlds for embodied AI research, featuring diverse entities and optimized tools, demonstrating benefits for RL agents but highlighting challenges in open-world scenarios.

Details

Motivation: To provide a realistic and diverse virtual environment for advancing embodied AI research, addressing the need for scalable and efficient tools for training and benchmarking.

Method: Developed UnrealZoo with Unreal Engine, extended UnrealCV with optimized APIs and tools for data collection, augmentation, distributed training, and benchmarking.

Result: Environmental diversity improves RL agent generalization, but challenges remain in navigation, adaptation, and latency management in open-world scenarios.

Conclusion: UnrealZoo is a valuable resource for testing and advancing embodied AI systems, though further work is needed to tackle open-world challenges.

Abstract: We introduce UnrealZoo, a collection of over 100 photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi-agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open-world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close-loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real-world deployment.

[270] Effort-aware Fairness: Incorporating a Philosophy-informed, Human-centered Notion of Effort into Algorithmic Fairness Metrics

Tin Nguyen, Jiannan Xu, Zora Che, Phuong-Anh Nguyen-Le, Rushil Dandamudi, Donald Braman, Furong Huang, Hal Daumé III, Zubin Jelveh

Main category: cs.AI

TL;DR: The paper introduces Effort-aware Fairness (EaF), a philosophy-informed approach to fairness in AI, emphasizing the importance of effort in fairness evaluations. It includes theoretical formulation, human experiments, and practical pipelines for computing fairness in real-world contexts.

Details

Motivation: Existing AI fairness metrics overlook the role of effort in fairness, despite its philosophical and human importance. The paper aims to address this gap by proposing a new approach.

Method: The paper combines theoretical formulation of Effort-aware Fairness (EaF) with empirical work, including a human subjects experiment and pipelines for computing fairness in criminal justice and personal finance contexts.

Result: Human experiments show people prioritize temporal trajectories of predictive features over aggregate values. Practical pipelines demonstrate the feasibility of EaF in real-world applications.

Conclusion: The proposed EaF framework can help AI auditors identify and correct unfair decisions, particularly for individuals who have made significant efforts but face systemic disadvantages.

Abstract: Although popularized AI fairness metrics, e.g., demographic parity, have uncovered bias in AI-assisted decision-making outcomes, they do not consider how much effort one has spent to get to where one is today in the input feature space. However, the notion of effort is important in how Philosophy and humans understand fairness. We propose a philosophy-informed approach to conceptualize and evaluate Effort-aware Fairness (EaF), grounded in the concept of Force, which represents the temporal trajectory of predictive features coupled with inertia. Besides theoretical formulation, our empirical contributions include: (1) a pre-registered human subjects experiment, which shows that for both stages of the (individual) fairness evaluation process, people consider the temporal trajectory of a predictive feature more than its aggregate value; (2) pipelines to compute Effort-aware Individual/Group Fairness in the criminal justice and personal finance contexts. Our work may enable AI model auditors to uncover and potentially correct unfair decisions against individuals who have spent significant efforts to improve but are still stuck with systemic disadvantages outside their control.

[271] When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning

Maxence Boels, Harry Robertshaw, Thomas C Booth, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

Main category: cs.AI

TL;DR: The paper compares imitation learning (IL) and reinforcement learning (RL) for surgical action planning, finding IL outperforms RL despite assumptions of RL’s superiority.

Details

Motivation: To evaluate whether RL can outperform IL in surgical action planning, given RL's potential for discovering superior strategies.

Method: The study introduces a Dual-task Autoregressive Imitation Learning (DARIL) baseline and tests three RL variants: world model-based RL, direct video RL, and inverse RL enhancement.

Result: DARIL achieved higher performance (34.6% mAP) than all RL variants, with RL methods underperforming (e.g., world model RL dropped to 3.1% mAP).

Conclusion: IL is systematically favored over RL in surgical action planning, challenging assumptions about RL’s superiority in sequential decision-making.

Abstract: Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.

[272] Probabilistic Active Goal Recognition

Chenyuan Zhang, Cristian Rojas Cardenas, Hamid Rezatofighi, Mor Vered, Buser Say

Main category: cs.AI

TL;DR: The paper introduces Active Goal Recognition (AGR) for multi-agent systems, combining probabilistic belief updates with MCTS for efficient goal inference without domain-specific knowledge. It outperforms passive methods and matches domain-specific baselines.

Details

Motivation: To improve interaction in multi-agent environments by actively reducing uncertainty about other agents' goals, moving beyond passive reasoning.

Method: Uses a probabilistic framework with joint belief updates and Monte Carlo Tree Search (MCTS) for domain-independent goal inference.

Result: The joint belief update outperforms passive goal recognition, and MCTS matches domain-specific greedy baselines.

Conclusion: The proposed framework is practical and robust, advancing interactive and adaptive multi-agent systems.

Abstract: In multi-agent environments, effective interaction hinges on understanding the beliefs and intentions of other agents. While prior work on goal recognition has largely treated the observer as a passive reasoner, Active Goal Recognition (AGR) focuses on strategically gathering information to reduce uncertainty. We adopt a probabilistic framework for Active Goal Recognition and propose an integrated solution that combines a joint belief update mechanism with a Monte Carlo Tree Search (MCTS) algorithm, allowing the observer to plan efficiently and infer the actor’s hidden goal without requiring domain-specific knowledge. Through comprehensive empirical evaluation in a grid-based domain, we show that our joint belief update significantly outperforms passive goal recognition, and that our domain-independent MCTS performs comparably to our strong domain-specific greedy baseline. These results establish our solution as a practical and robust framework for goal inference, advancing the field toward more interactive and adaptive multi-agent systems.

[273] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu

Main category: cs.AI

TL;DR: Cognitive Kernel-Pro is an open-source, free AI agent framework designed to democratize advanced AI agent development, achieving state-of-the-art results.

Details

Motivation: Current AI agent systems are often closed-source or rely on paid APIs, limiting accessibility and reproducibility.

Method: The framework includes high-quality training data curation, agent test-time reflection, and voting strategies across web, file, code, and reasoning domains.

Result: The 8B-parameter open-source model outperforms leading systems like WebDancer and WebSailor on GAIA.

Conclusion: Cognitive Kernel-Pro sets a new standard for accessible, high-capability AI agents.

Abstract: General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

[274] Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance

Fengze Yang, Bo Yu, Yang Zhou, Xuewen Luo, Zhengzhong Tu, Chenxi Liu

Main category: cs.AI

TL;DR: REACT, a V2X-integrated trajectory planner for autonomous driving, uses a lightweight Vision-Language Model for real-time, safety-oriented trajectory optimization, reducing collisions by 77% and achieving low latency.

Details

Motivation: Current AD systems and V2X approaches fail in multimodal fusion, real-time performance, or hazard detection, necessitating a solution like REACT.

Method: REACT combines infrastructure alerts and onboard data via a fine-tuned VLM, uses Residual Trajectory Fusion for efficiency, and adapts for edge deployment.

Result: REACT reduces collisions by 77%, achieves 48.2% VPQ, and 0.57s latency on Jetson AGX Orin, outperforming benchmarks.

Conclusion: Lightweight VLMs enable real-time cooperative planning, proving language-guided reasoning enhances traffic safety and responsiveness.

Abstract: Autonomous driving (AD) systems relying solely on onboard sensors may fail to detect distant or obstacle hazards, potentially causing preventable collisions; however, existing transformer-based Vehicle-to-Everything (V2X) approaches, which mitigate AD sensing limitations, either lack effective multimodal fusion and reasoning or struggle to meet real-time performance requirements under complex, high-dimensional traffic conditions. This paper proposes the Real-time Edge-based Autonomous Co-pilot Trajectory planner (REACT), a V2X-integrated trajectory optimization framework for AD based on a fine-tuned lightweight Vision-Language Model (VLM). REACT integrates infrastructure-provided hazard alerts with onboard sensor data, capturing intricate surrounding traffic dynamics and vehicle intents through visual embeddings, interpreting precise numerical data from symbolic inputs, and employing contextual reasoning to generate optimized, safety-oriented trajectories. To ensure robust real-time deployment on edge devices, REACT innovatively employs Residual Trajectory Fusion (RTF) design and specialized edge-adaptation strategies to reduce model complexity and improve inference efficiency. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results highlight the effectiveness of lightweight VLMs in enabling real-time cooperative planning on edge platforms and underscore the potential of language-guided contextual reasoning for improving traffic safety and responsiveness.

[275] Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

Main category: cs.AI

TL;DR: Dynamic Mask Attention (DMA) introduces a trainable sparse attention mechanism to address the quadratic complexity of self-attention, balancing efficiency and long-context modeling.

Details

Motivation: The need for efficient long-context modeling in large language models, overcoming limitations of existing sparse attention methods like static patterns or information loss.

Method: DMA uses content-aware and position-aware sparsity: dynamically generating masks from value representations and skipping unnecessary calculations.

Result: DMA outperforms other attention mechanisms in perplexity, associative recall tasks, and needle-in-a-haystack evaluations.

Conclusion: DMA effectively balances computational efficiency and long-context modeling, proving superior in performance and scalability.

Abstract: In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

[276] InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang

Main category: cs.AI

TL;DR: InfiAlign is a scalable, sample-efficient post-training framework combining SFT and DPO to enhance LLM reasoning, reducing data needs by 88% while maintaining performance.

Details

Motivation: Current methods for improving LLM reasoning are resource-intensive and lack scalability due to heuristic or task-specific data curation.

Method: InfiAlign integrates SFT and DPO with a data selection pipeline using multidimensional quality metrics to curate high-quality alignment data.

Result: The SFT model matches DeepSeek-R1-Distill-Qwen-7B performance with 12% of the data; DPO further improves mathematical reasoning by 3.89% on AIME benchmarks.

Conclusion: Combining principled data selection with full-stage post-training offers a scalable, data-efficient solution for aligning large reasoning models.

Abstract: Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.

[277] IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun

Main category: cs.AI

TL;DR: The paper introduces IRL-VLA, a three-stage framework combining imitation learning, inverse reinforcement learning, and reinforcement learning to improve Vision-Language-Action models for autonomous driving.

Details

Motivation: Existing VLA models face challenges in open-loop imitation learning and close-loop training due to domain gaps and inefficiencies.

Method: A three-stage approach: pretraining VLA via imitation learning, building a lightweight reward world model via inverse reinforcement learning, and refining with PPO-based reinforcement learning.

Result: Achieves state-of-the-art performance in NAVSIM v2 and 1st runner up in CVPR2025 Autonomous Grand Challenge.

Conclusion: The framework advances VLA research for close-loop autonomous driving.

Abstract: Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

[278] Large Language Models Do Not Simulate Human Psychology

Sarah Schröder, Thekla Morgenroth, Ulrike Kuhl, Valerie Vaquet, Benjamin Paaßen

Main category: cs.AI

TL;DR: LLMs like ChatGPT are not reliable simulators of human psychology, as shown by conceptual arguments and empirical evidence of response discrepancies.

Details

Motivation: To challenge the idea that LLMs can replace human participants in psychological studies by demonstrating their unreliability.

Method: Present conceptual arguments and empirical evidence, including tests with fine-tuned models like CENTAUR, to show discrepancies in responses.

Result: LLMs exhibit notable response differences from humans, especially with slight wording changes, and vary widely among themselves.

Conclusion: LLMs should not replace human participants in psychology; they are useful but require validation against human responses for each application.

Abstract: Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs’ and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.

[279] Designing a Feedback-Driven Decision Support System for Dynamic Student Intervention

Timothy Oluwapelumi Adeyemi, Nadiah Fahad AlOtaibi

Main category: cs.AI

TL;DR: A Feedback-Driven Decision Support System (DSS) with adaptive learning is proposed to improve student performance prediction by continuously refining models using new data.

Details

Motivation: Static machine learning models in education lack adaptability to new data, limiting their effectiveness for timely interventions.

Method: The system uses a LightGBM-based regressor with incremental retraining, a Flask web interface, and SHAP for interpretability.

Result: Experimental results show a 10.7% RMSE reduction and improved predictions for students post-intervention.

Conclusion: The framework enhances educational analytics by enabling self-improving, transparent, and deployable AI systems.

Abstract: Accurate prediction of student performance is essential for enabling timely academic interventions. However, most machine learning models used in educational settings are static and lack the ability to adapt when new data such as post-intervention outcomes become available. To address this limitation, we propose a Feedback-Driven Decision Support System (DSS) with a closed-loop architecture that enables continuous model refinement. The system employs a LightGBM-based regressor with incremental retraining, allowing educators to input updated student performance data, which automatically triggers model updates. This adaptive mechanism enhances prediction accuracy by learning from real-world academic progress over time. The platform features a Flask-based web interface to support real-time interaction and integrates SHAP (SHapley Additive exPlanations) for model interpretability, ensuring transparency and trustworthiness in predictions. Experimental results demonstrate a 10.7% reduction in RMSE after retraining, with consistent upward adjustments in predicted scores for students who received interventions. By transforming static predictive models into self-improving systems, our approach advances educational analytics toward human-centered, data-driven, and responsive artificial intelligence. The framework is designed for seamless integration into Learning Management Systems (LMS) and institutional dashboards, facilitating practical deployment in real educational environments.

[280] Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths

Rui Yao, Qi Chai, Jinhai Yao, Siyuan Li, Junhao Chen, Qi Zhang, Hao Wang

Main category: cs.AI

TL;DR: The paper introduces an LLM-based framework to decode Fedspeak, enhancing policy stance classification with uncertainty-aware methods and domain-specific reasoning.

Details

Motivation: Fedspeak's nuanced language impacts financial markets and policy analysis, necessitating automated interpretation for better forecasting and trading.

Method: Proposes an LLM-based framework with domain-specific reasoning and a dynamic uncertainty decoding module for improved accuracy and reliability.

Result: Achieves state-of-the-art performance in policy stance analysis, with perceptual uncertainty validated as a diagnostic signal.

Conclusion: The framework effectively deciphers Fedspeak, offering reliable insights for financial and policy applications.

Abstract: “Fedspeak”, the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. In this paper, we propose an LLM-based, uncertainty-aware framework for deciphering Fedspeak and classifying its underlying monetary policy stance. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.

[281] Fitting Description Logic Ontologies to ABox and Query Examples

Maurice Funk, Marvin Grosser, Carsten Lutz

Main category: cs.AI

TL;DR: The paper studies ontology fitting for querying, analyzing complexity for various query languages in description logics.

Details

Motivation: To address the challenge of fitting an ontology to satisfy positive and negative examples in ontology-mediated querying.

Method: Uses description logics ALC and ALCI, analyzing fitting problems for atomic, conjunctive, and union queries.

Result: Shows NP complexity for AQs and full CQs, and 2EXPTIME-completeness for CQs and UCQs.

Conclusion: The fitting problem’s complexity varies by query type, with consistent results across ALC and ALCI.

Abstract: We study a fitting problem inspired by ontology-mediated querying: given a collection of positive and negative examples of the form $(\mathcal{A},q)$ with $\mathcal{A}$ an ABox and $q$ a Boolean query, we seek an ontology $\mathcal{O}$ that satisfies $\mathcal{A} \cup \mathcal{O} \vDash q$ for all positive examples and $\mathcal{A} \cup \mathcal{O}\not\vDash q$ for all negative examples. We consider the description logics $\mathcal{ALC}$ and $\mathcal{ALCI}$ as ontology languages and a range of query languages that includes atomic queries (AQs), conjunctive queries (CQs), and unions thereof (UCQs). For all of the resulting fitting problems, we provide effective characterizations and determine the computational complexity of deciding whether a fitting ontology exists. This problem turns out to be ${\scriptsize CO}NP$ for AQs and full CQs and $2E{\scriptsize XP}T{\scriptsize IME}$-complete for CQs and UCQs. These results hold for both $\mathcal{ALC}$ and $\mathcal{ALCI}$.

cs.SD

[282] Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Tharm Ratnarajah

Main category: cs.SD

TL;DR: The paper presents an AI-based Audio-Visual Speech Enhancement (AVSE) system using CNNs and LSTMs, comparing cloud, edge, and standalone deployments. Edge-assisted architectures balance latency and quality best under 5G/Wi-Fi 6.

Details

Motivation: To enhance speech quality in real-world applications by leveraging multimodal audio-visual fusion and exploring optimal deployment architectures.

Method: Uses CNNs for spectral features and LSTMs for temporal modeling, testing cloud, edge, and standalone deployments across various network conditions.

Result: Cloud offers highest quality, but edge-assisted architectures provide the best latency-quality balance, suitable for real-time use under 5G/Wi-Fi 6.

Conclusion: Edge-assisted AVSE is optimal for real-time applications, offering practical deployment guidelines for diverse use cases.

Abstract: This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

[283] Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

Chaoqun Cui, Liangbin Huang, Shijing Wang, Zhe Tong, Zhaolong Huang, Xiao Zeng, Xiaofeng Liu

Main category: cs.SD

TL;DR: The paper introduces SSPO, a method for aligning speech durations in video dubbing to solve synchronization issues caused by language differences.

Details

Motivation: Addressing audio-video synchronization problems in video dubbing due to mismatched speech durations across languages.

Method: Proposes Segment Supervised Preference Optimization (SSPO), using segment-wise sampling and fine-grained loss for duration alignment.

Result: SSPO outperforms in duration alignment tasks, improving synchronization.

Conclusion: SSPO effectively mitigates duration mismatches in video dubbing, enhancing viewer experience.

Abstract: Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.

[284] Multi-Target Backdoor Attacks Against Speaker Recognition

Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesus Villalba Lopez, Najim Dehak, Patrick Cardinal

Main category: cs.SD

TL;DR: A multi-target backdoor attack on speaker identification uses clicking sounds as triggers, targeting up to 50 speakers with 95.04% success. It also works for speaker verification, achieving 90% success for similar speaker pairs.

Details

Motivation: To develop a more realistic and scalable backdoor attack on speaker identification and verification systems, moving beyond single-target limitations.

Method: Uses position-independent clicking sounds as triggers, varies signal-to-noise ratios for stealth, and selects targets via cosine similarity in verification tasks.

Result: Achieves up to 95.04% success in multi-target speaker identification and 90% in verification for highly similar speaker pairs.

Conclusion: The attack is highly effective and scalable, with performance influenced by target similarity and trigger stealth.

Abstract: In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as the target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.

[285] SonicRadiation: A Hybrid Numerical Solution for Sound Radiation without Ghost Cells

Xutong Jin, Guoping Wang, Sheng Li

Main category: cs.SD

TL;DR: SonicRadiation is a hybrid method combining FDTD and TDBEM for accurate and efficient sound radiation simulation, overcoming ghost cell limitations in complex boundaries.

Details

Motivation: Addressing challenges in sound radiation simulation for complex object boundaries, where previous methods like ghost cell-based FDTD failed due to large errors.

Method: A hybrid approach integrating FDTD and TDBEM with a boundary grid synchronization strategy, leveraging TDBEM’s near-field accuracy and FDTD’s far-field efficiency.

Result: Superior accuracy and efficiency in sound radiation simulation, especially in complex scenes, compared to prior methods.

Conclusion: SonicRadiation effectively handles complex and dynamic boundaries, offering a robust solution for sound radiation simulation.

Abstract: Interactive synthesis of physical sound effects is crucial in digital media production. Sound radiation simulation, a key component of physically based sound synthesis, has posed challenges in the context of complex object boundaries. Previous methods, such as ghost cell-based finite-difference time-domain (FDTD) wave solver, have struggled to address these challenges, leading to large errors and failures in complex boundaries because of the limitation of ghost cells. We present SonicRadiation, a hybrid numerical solution capable of handling complex and dynamic object boundaries in sound radiation simulation without relying on ghost cells. We derive a consistent formulation to connect the physical quantities on grid cells in FDTD with the boundary elements in the time-domain boundary element method (TDBEM). Hereby, we propose a boundary grid synchronization strategy to seamlessly integrate TDBEM with FDTD while maintaining high numerical accuracy. Our method holds both advantages from the accuracy of TDBEM for the near-field and the efficiency of FDTD for the far-field. Experimental results demonstrate the superiority of our method in sound radiation simulation over previous approaches in terms of accuracy and efficiency, particularly in complex scenes, further validating its effectiveness.

[286] Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems

Liam Pram, Fabio Morreale

Main category: cs.SD

TL;DR: The paper critiques the rhetoric of democratization in generative-AI music systems, revealing a disconnect between marketing and actual inclusivity.

Details

Motivation: To investigate the ideologies driving generative-AI music systems, focusing on democratization claims.

Method: Combines autoethnography and digital ethnography to analyze four systems (AIVA, Stable Audio, Suno, Udio) and their rhetoric.

Result: Identifies a shared ’total ideology’ among producers and consumers: individualist, globalist, techno-liberal, and ethically evasive.

Conclusion: The ideology transforms music practice to fit generative outcomes, masking responsibility and inclusivity gaps.

Abstract: AI systems for music generation are increasingly common and easy to use, granting people without any musical background the ability to create music. Because of this, generative-AI has been marketed and celebrated as a means of democratizing music making. However, inclusivity often functions as marketable rhetoric rather than a genuine guiding principle in these industry settings. In this paper, we look at four generative-AI music making systems available to the public as of mid-2025 (AIVA, Stable Audio, Suno, and Udio) and track how they are rhetoricized by their developers, and received by users. Our aim is to investigate ideologies that are driving the early-stage development and adoption of generative-AI in music making, with a particular focus on democratization. A combination of autoethnography and digital ethnography is used to examine patterns and incongruities in rhetoric when positioned against product functionality. The results are then collated to develop a nuanced, contextual discussion. The shared ideology we map between producers and consumers is individualist, globalist, techno-liberal, and ethically evasive. It is a ’total ideology’ which obfuscates individual responsibility, and through which the nature of music and musical practice is transfigured to suit generative outcomes.

[287] Sound Signal Synthesis with Auxiliary Classifier GAN, COVID-19 cough as an example

Yahya Sherif Solayman Mohamed Saleh, Ahmed Mohammed Dabbous, Lama Alkhaled, Hum Yan Chai, Muhammad Ehsan Rana, Hamam Mokayed

Main category: cs.SD

TL;DR: The paper explores using synthetic cough data to improve COVID-19 detection via ML models, achieving a 3% accuracy boost.

Details

Motivation: Addressing data scarcity in healthcare AI, especially during pandemics like COVID19, to enhance diagnostic accuracy.

Method: Uses an Auxiliary Classification GAN (ACGAN) to generate synthetic cough data, augmenting a CNN classifier trained on the Coughvid dataset.

Result: Synthetic data improved CNN test accuracy from 72% to 75%.

Conclusion: Synthetic data can mitigate healthcare data scarcity, though challenges in training consistency remain.

Abstract: One of the fastest-growing domains in AI is healthcare. Given its importance, it has been the interest of many researchers to deploy ML models into the ever-demanding healthcare domain to aid doctors and increase accessibility. Delivering reliable models, however, demands a sizable amount of data, and the recent COVID-19 pandemic served as a reminder of the rampant and scary nature of healthcare that makes training models difficult. To alleviate such scarcity, many published works attempted to synthesize radiological cough data to train better COVID-19 detection models on the respective radiological data. To accommodate the time sensitivity expected during a pandemic, this work focuses on detecting COVID-19 through coughs using synthetic data to improve the accuracy of the classifier. The work begins by training a CNN on a balanced subset of the Coughvid dataset, establishing a baseline classification test accuracy of 72%. The paper demonstrates how an Auxiliary Classification GAN (ACGAN) may be trained to conditionally generate novel synthetic Mel Spectrograms of both healthy and COVID-19 coughs. These coughs are used to augment the training dataset of the CNN classifier, allowing it to reach a new test accuracy of 75%. The work highlights the expected messiness and inconsistency in training and offers insights into detecting and handling such shortcomings.

[288] QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

Chien-Chun Wang, Kuan-Tang Huang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

Main category: cs.SD

TL;DR: QAMRO is a new framework for evaluating audio generation systems by integrating regression objectives to better align with human perception, outperforming baselines.

Details

Motivation: Existing methods for evaluating audio generation systems overlook the relativity of perceptual judgments, limiting their accuracy.

Method: QAMRO combines regression objectives from different perspectives using pre-trained models like CLAP and Audiobox-Aesthetics, trained on the AudioMOS Challenge 2025 dataset.

Result: QAMRO shows superior alignment with human evaluations across all dimensions, outperforming baseline models.

Conclusion: QAMRO effectively addresses the limitations of existing methods, improving accuracy in audio generation evaluation.

Abstract: Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.

[289] DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Main category: cs.SD

TL;DR: The paper proposes a dual-token framework (DualSpeechLM) and an Understanding-driven Speech Tokenizer (USTokenizer) to unify speech understanding and generation in LLMs, addressing modality gaps and task divergence.

Details

Motivation: To overcome challenges in unifying speech understanding and generation in LLMs, such as modality gaps and divergent task requirements.

Method: Introduces USTokenizer for semantic tokenization and DualSpeechLM for dual-token modeling, with semantic supervision loss and Chain-of-Condition (CoC) strategy.

Result: Effective integration of understanding and generation tasks, with improved performance and training stability.

Conclusion: The approach successfully unifies speech tasks, demonstrating mutual enhancement in a single model.

Abstract: Extending pre-trained Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

[290] Revealing the Role of Audio Channels in ASR Performance Degradation

Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

Main category: cs.SD

TL;DR: The paper proposes a normalization technique to improve ASR performance by aligning feature representations with a clean reference channel, addressing degradation caused by channel variations.

Details

Motivation: ASR models degrade when input audio comes from different recording channels, often attributed to training-testing corpus mismatch. The study argues channel variations fundamentally harm ASR performance.

Method: A normalization technique aligns internal feature representations in the ASR model with those from a clean reference channel.

Result: The approach significantly improves ASR performance on unseen channels and languages, demonstrating generalization across channel and language differences.

Conclusion: The proposed normalization effectively mitigates channel variation impact, enhancing ASR robustness across diverse inputs.

Abstract: Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences.

[291] Neutone SDK: An Open Source Framework for Neural Audio Processing

Christopher Mitcheltree, Bogdan Teleaga, Andrew Fyfe, Naotake Masuda, Matthias Schäfer, Alfie Bradic, Nao Tokui

Main category: cs.SD

TL;DR: The Neutone SDK is an open-source framework simplifying the deployment of PyTorch-based neural audio models in DAWs, addressing challenges like real-time inference and plugin development.

Details

Motivation: Integrating neural audio models into DAWs is hindered by real-time constraints and plugin complexities.

Method: The SDK provides a unified, model-agnostic interface for handling buffer sizes, sample rate conversion, delay compensation, and control parameters, enabling Python-based workflow.

Result: The framework supports diverse applications like audio effect emulation, timbre transfer, and sample generation, and is widely adopted.

Conclusion: The Neutone SDK bridges the gap between neural audio models and DAWs, facilitating seamless integration and broad usability.

Abstract: Neural audio processing has unlocked novel methods of sound transformation and synthesis, yet integrating deep learning models into digital audio workstations (DAWs) remains challenging due to real-time / neural network inference constraints and the complexities of plugin development. In this paper, we introduce the Neutone SDK: an open source framework that streamlines the deployment of PyTorch-based neural audio models for both real-time and offline applications. By encapsulating common challenges such as variable buffer sizes, sample rate conversion, delay compensation, and control parameter handling within a unified, model-agnostic interface, our framework enables seamless interoperability between neural models and host plugins while allowing users to work entirely in Python. We provide a technical overview of the interfaces needed to accomplish this, as well as the corresponding SDK implementations. We also demonstrate the SDK’s versatility across applications such as audio effect emulation, timbre transfer, and sample generation, as well as its adoption by researchers, educators, companies, and artists alike. The Neutone SDK is available at https://github.com/Neutone/neutone_sdk

[292] Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation

Yan Rong, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: Dopamine Audiobook proposes a unified, training-free multi-agent system for immersive audiobook generation, addressing challenges in audio alignment, emotional expressiveness, and automated evaluation.

Details

Motivation: Current audiobook generation lacks synergistic audio alignment, expressive emotions, and human-aligned evaluation.

Method: Uses a multimodal large language model (MLLM) for speech and audio design, with flow-based context-aware generation, paralinguistic augmentation, prosody retrieval, and adaptive TTS selection. Introduces an MLLM-based evaluation framework with self-critique and psychological prompts.

Result: Achieves SOTA performance on multiple metrics, with better human preference alignment and task transferability.

Conclusion: Dopamine Audiobook effectively addresses key challenges in audiobook generation, offering a scalable and human-aligned solution.

Abstract: Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training-free multi-agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human-like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow-based, context-aware framework for diverse audio generation with word-level semantic and temporal alignment. To enhance expressiveness, we then design word-level paralinguistic augmentation, utterance-level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.

[293] Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification

Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura

Main category: cs.SD

TL;DR: MAE-pretrained Transformers outperform CNNs in modeling marmoset vocal communication by addressing overfitting and instability on small, noisy datasets.

Details

Motivation: Marmoset vocalizations are less structured and recorded in noisy conditions, making joint call segmentation, classification, and caller identification challenging.

Method: Applied Transformers with self-attention for global dependencies and pretrained them using MAE (self-supervised method) on unannotated recordings.

Result: MAE-pretrained Transformers showed improved stability and generalization, outperforming CNNs.

Conclusion: Self-supervised architectures like MAE-pretrained Transformers effectively model low-resource non-human vocal communication.

Abstract: The marmoset, a highly vocal primate, is a key model for studying social-communicative behavior. Unlike human speech, marmoset vocalizations are less structured, highly variable, and recorded in noisy, low-resource conditions. Learning marmoset communication requires joint call segmentation, classification, and caller identification – challenging domain tasks. Previous CNNs handle local patterns but struggle with long-range temporal structure. We applied Transformers using self-attention for global dependencies. However, Transformers show overfitting and instability on small, noisy annotated datasets. To address this, we pretrain Transformers with MAE – a self-supervised method reconstructing masked segments from hundreds of hours of unannotated marmoset recordings. The pretraining improved stability and generalization. Results show MAE-pretrained Transformers outperform CNNs, demonstrating modern self-supervised architectures effectively model low-resource non-human vocal communication.

[294] Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu

Main category: cs.SD

TL;DR: Audio-Thinker enhances LALMs’ reasoning via adaptive rewards and external evaluation, outperforming existing models.

Details

Motivation: Current LALMs lack explicit reasoning benefits for audio QA and fall short of human-level auditory-language reasoning.

Method: Proposes Audio-Thinker, a reinforcement learning framework with adaptive think accuracy rewards and external reward models.

Result: Outperforms existing reasoning-oriented LALMs in benchmarks, showing superior reasoning and generalization.

Conclusion: Audio-Thinker effectively addresses LALMs’ reasoning limitations, improving adaptability and consistency.

Abstract: Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

cs.LG

[295] Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Ryan Mioduski

Main category: cs.LG

TL;DR: LLMs effectively convert historical land patent descriptions into accurate coordinates, outperforming traditional methods and offering cost-effective scalability.

Details

Motivation: To address the limitation of spatial analysis due to narrative land patent descriptions by leveraging LLMs for accurate georeferencing.

Method: Evaluated six OpenAI models across three architectures using direct-to-coordinate and tool-augmented approaches, compared with GIS-analyst and other baselines.

Result: Top model achieved a mean error of 23 km, outperforming baselines by up to 70%. Ensembles further reduced errors to 19 km.

Conclusion: LLMs show strong potential for scalable, accurate, and cost-effective historical georeferencing.

Abstract: Virginia’s seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures (o-series, GPT-4-class, and GPT-3.5) were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared with a GIS-analyst baseline, the Stanford NER geoparser, Mordecai-3, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19 km (median 12 km) at minimal additional cost (approx. USD 0.20 per grant), outperforming the median LLM by 48.6%. A patentee-name-redaction ablation increased error by about 9%, indicating reliance on textual landmark and adjacency descriptions rather than memorization. The cost-efficient gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark; external geocoding tools offered no measurable benefit in this evaluation. These findings demonstrate the potential of LLMs for scalable, accurate, and cost-effective historical georeferencing.

[296] Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI

Dong Xue, Ziyao Shao, Zhaoyang Duan, Fangzhou Liu, Bing Li, Zhongheng Zhang

Main category: cs.LG

TL;DR: Doctor Sun is a specialized multimodal model for medicine, addressing limitations in existing biomedical AI by integrating text and image data with a two-stage training approach.

Details

Motivation: Existing biomedical AI models lack understanding of complex medical concepts and struggle with text-image relationships, prompting the development of Doctor Sun.

Method: Doctor Sun combines a pre-trained vision encoder with a medical LLM, using two-stage training (feature alignment and instruction tuning) on diverse medical datasets.

Result: The model is released alongside SunMed-VL, a bilingual medical multimodal dataset, to advance biomedical research.

Conclusion: Doctor Sun and SunMed-VL aim to improve biomedical multimodal AI by addressing current limitations and providing open resources.

Abstract: Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.

[297] Towards Heterogeneity-Aware and Energy-Efficient Topology Optimization for Decentralized Federated Learning in Edge Environment

Yuze Liu, Tiehua Zhang, Zhishu Shen, Libing Wu, Shiping Chen, Jiong Jin

Main category: cs.LG

TL;DR: The paper proposes Hat-DFed, a decentralized federated learning framework for edge computing, addressing challenges like communication bottlenecks, resource heterogeneity, and data heterogeneity by optimizing topology and energy efficiency.

Details

Motivation: The motivation is to overcome the high costs and inefficiencies in decentralized federated learning (DFL) caused by dynamic topology changes, resource heterogeneity, and data heterogeneity in edge computing systems.

Method: The method involves formulating topology construction as a dual optimization problem (proven NP-hard) and designing a two-phase algorithm to dynamically optimize communication topologies and mitigate data heterogeneity effects.

Result: Hat-DFed maximizes model performance while minimizing energy consumption, addressing both topology and data heterogeneity challenges.

Conclusion: The conclusion is that Hat-DFed effectively balances model performance and energy efficiency in decentralized federated learning for edge computing, providing a practical solution to existing challenges.

Abstract: Federated learning (FL) has emerged as a promising paradigm within edge computing (EC) systems, enabling numerous edge devices to collaboratively train artificial intelligence (AI) models while maintaining data privacy. To overcome the communication bottlenecks associated with centralized parameter servers, decentralized federated learning (DFL), which leverages peer-to-peer (P2P) communication, has been extensively explored in the research community. Although researchers design a variety of DFL approach to ensure model convergence, its iterative learning process inevitably incurs considerable cost along with the growth of model complexity and the number of participants. These costs are largely influenced by the dynamic changes of topology in each training round, particularly its sparsity and connectivity conditions. Furthermore, the inherent resources heterogeneity in the edge environments affects energy efficiency of learning process, while data heterogeneity degrades model performance. These factors pose significant challenges to the design of an effective DFL framework for EC systems. To this end, we propose Hat-DFed, a heterogeneity-aware and coset-effective decentralized federated learning (DFL) framework. In Hat-DFed, the topology construction is formulated as a dual optimization problem, which is then proven to be NP-hard, with the goal of maximizing model performance while minimizing cumulative energy consumption in complex edge environments. To solve this problem, we design a two-phase algorithm that dynamically constructs optimal communication topologies while unbiasedly estimating their impact on both model performance and energy cost. Additionally, the algorithm incorporates an importance-aware model aggregation mechanism to mitigate performance degradation caused by data heterogeneity.

[298] XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting

Ziqi Wang, Hailiang Zhao, Cheng Bao, Wenzhuo Qian, Yuhao Yang, Xueqiang Sun, Shuiguang Deng

Main category: cs.LG

TL;DR: XFMNet is a multimodal fusion network for water quality prediction, integrating remote sensing data to handle temporal and spatial dynamics, outperforming existing methods.

Details

Motivation: Water quality prediction is challenging due to complex periodicity, nonstationarity, and abrupt fluctuations, especially in multi-site scenarios requiring simultaneous temporal and spatial modeling.

Method: XFMNet uses adaptive downsampling, locally adaptive decomposition, and a cross-attention gated fusion module to integrate temporal, spatial, and ecological data.

Result: Experiments show XFMNet significantly outperforms state-of-the-art baselines in spatially distributed time series prediction.

Conclusion: XFMNet effectively addresses the challenges of water quality forecasting by combining temporal and spatial dynamics, demonstrating superior performance.

Abstract: Long-term time-series forecasting is critical for environmental monitoring, yet water quality prediction remains challenging due to complex periodicity, nonstationarity, and abrupt fluctuations induced by ecological factors. These challenges are further amplified in multi-site scenarios that require simultaneous modeling of temporal and spatial dynamics. To tackle this, we introduce XFMNet, a stepwise multimodal fusion network that integrates remote sensing precipitation imagery to provide spatial and environmental context in river networks. XFMNet first aligns temporal resolutions between water quality series and remote sensing inputs via adaptive downsampling, followed by locally adaptive decomposition to disentangle trend and cycle components. A cross-attention gated fusion module dynamically integrates temporal patterns with spatial and ecological cues, enhancing robustness to nonstationarity and site-specific anomalies. Through progressive and recursive fusion, XFMNet captures both long-term trends and short-term fluctuations. Extensive experiments on real-world datasets demonstrate substantial improvements over state-of-the-art baselines, highlighting the effectiveness of XFMNet for spatially distributed time series prediction.

[299] MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder

Seonyoung Kim, Dongil Kim

Main category: cs.LG

TL;DR: MoSSDA is a novel two-step SSDA framework for time-series classification, using a domain-invariant encoder and mixup-enhanced contrastive learning to handle domain shifts and achieve state-of-the-art performance.

Details

Motivation: Address performance degradation in deep learning models due to domain shifts, especially in time-series data, by leveraging limited labeled target domain data.

Method: Two-step framework: domain-invariant encoder for feature learning and mixup-enhanced positive contrastive module with a momentum encoder. Two-stage gradient flow separation for rich representations.

Result: Achieved state-of-the-art performance on six datasets with various backbones and unlabeled ratios. Ablation studies confirmed module effectiveness.

Conclusion: MoSSDA effectively handles domain shifts in time-series data, improving robustness and discriminability without data augmentation.

Abstract: Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at https://github.com/seonyoungKimm/MoSSDA

[300] Multi-grained spatial-temporal feature complementarity for accurate online cellular traffic prediction

Ningning Fu, Shengheng Liu, Weiliang Xie, Yongming Huang

Main category: cs.LG

TL;DR: The paper introduces MGSTC, an online cellular traffic prediction method using multi-grained spatial-temporal features to address sporadic, bursty traffic and concept drift, outperforming existing baselines.

Details

Motivation: Current telecom data analysis relies on manual intervention and overlooks cellular traffic characteristics like sporadic bursts and concept drift, necessitating a more adaptive prediction method.

Method: MGSTC segments historical data, uses coarse-grained temporal attention for trend reference, and fine-grained spatial attention for detailed correlations, with online learning for real-time concept drift handling.

Result: MGSTC consistently outperforms eleven state-of-the-art baselines on four real-world datasets.

Conclusion: MGSTC effectively addresses cellular traffic prediction challenges, offering high accuracy and adaptability in continuous forecasting scenarios.

Abstract: Knowledge discovered from telecom data can facilitate proactive understanding of network dynamics and user behaviors, which in turn empowers service providers to optimize cellular traffic scheduling and resource allocation. Nevertheless, the telecom industry still heavily relies on manual expert intervention. Existing studies have been focused on exhaustively explore the spatial-temporal correlations. However, they often overlook the underlying characteristics of cellular traffic, which are shaped by the sporadic and bursty nature of telecom services. Additionally, concept drift creates substantial obstacles to maintaining satisfactory accuracy in continuous cellular forecasting tasks. To resolve these problems, we put forward an online cellular traffic prediction method grounded in Multi-Grained Spatial-Temporal feature Complementarity (MGSTC). The proposed method is devised to achieve high-precision predictions in practical continuous forecasting scenarios. Concretely, MGSTC segments historical data into chunks and employs the coarse-grained temporal attention to offer a trend reference for the prediction horizon. Subsequently, fine-grained spatial attention is utilized to capture detailed correlations among network elements, which enables localized refinement of the established trend. The complementarity of these multi-grained spatial-temporal features facilitates the efficient transmission of valuable information. To accommodate continuous forecasting needs, we implement an online learning strategy that can detect concept drift in real-time and promptly switch to the appropriate parameter update stage. Experiments carried out on four real-world datasets demonstrate that MGSTC outperforms eleven state-of-the-art baselines consistently.

[301] Understanding Transformers through the Lens of Pavlovian Conditioning

Mu Qiao

Main category: cs.LG

TL;DR: The paper reinterprets transformer attention mechanisms as Pavlovian conditioning, offering a theoretical framework with insights into capacity, error propagation, and biological plausibility.

Details

Motivation: To demystify the computational principles behind transformer success by linking attention mechanisms to classical conditioning.

Method: Theoretical framework mapping attention’s queries, keys, and values to classical conditioning elements, analyzed via linear attention.

Result: Derived a capacity theorem, error propagation analysis, and insights into biologically plausible learning rules.

Conclusion: Transformer success may stem from implementing biologically optimized computational principles, not just architectural novelty.

Abstract: Transformer architectures have revolutionized artificial intelligence (AI) through their attention mechanisms, yet the computational principles underlying their success remain opaque. We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning. Our model finds a direct mathematical analogue in linear attention, which simplifies the analysis of the underlying associative process. We demonstrate that attention’s queries, keys, and values can be mapped to the three elements of classical conditioning: test stimuli that probe associations, conditional stimuli (CS) that serve as retrieval cues, and unconditional stimuli (US) that contain response information. Through this lens, we suggest that each attention operation constructs a transient associative memory via a Hebbian rule, where CS-US pairs form dynamic associations that test stimuli can later retrieve. Our framework yields several theoretical insights grounded in this linearized model: (1) a capacity theorem showing that attention heads can store O($\sqrt{d_k}$) associations before interference degrades retrieval; (2) an error propagation analysis revealing fundamental architectural trade-offs of balancing model depth, width, and head redundancy to maintain reliability; and (3) an understanding of how biologically plausible learning rules could enhance transformer architectures. By establishing this deep connection, we suggest that the success of modern AI may stem not from architectural novelty alone, but from implementing computational principles that biology optimized over millions of years of evolution.

[302] Probabilistic Emissivity Retrieval from Hyperspectral Data via Physics-Guided Variational Inference

Joshua R. Tempelman, Kevin Mitchell, Adam J. Wachtor, Eric B. Flynn

Main category: cs.LG

TL;DR: The paper introduces a physics-conditioned generative model for hyperspectral imaging (HSI) target identification, improving interpretability and avoiding bias from predefined training sets.

Details

Motivation: Current deep learning frameworks for HSI target identification lack interpretability and are limited by predefined training libraries.

Method: A probabilistic latent-variable model learns HSI radiance distributions, uses atmospheric and background conditions, and employs physics-based loss criteria and Monte-Carlo sampling.

Result: The model produces emissivity distributions with uncertainty quantification and a distribution-based material matching scheme.

Conclusion: The approach enhances HSI target identification by incorporating contextual information, capturing material variation, and providing interpretable probability measures.

Abstract: Recent research has proven neural networks to be a powerful tool for performing hyperspectral imaging (HSI) target identification. However, many deep learning frameworks deliver a single material class prediction and operate on a per-pixel basis; such approaches are limited in their interpretability and restricted to predicting materials that are accessible in available training libraries. In this work, we present an inverse modeling approach in the form of a physics-conditioned generative model.A probabilistic latent-variable model learns the underlying distribution of HSI radiance measurements and produces the conditional distribution of the emissivity spectrum. Moreover, estimates of the HSI scene’s atmosphere and background are used as a physically relevant conditioning mechanism to contextualize a given radiance measurement during the encoding and decoding processes. Furthermore, we employ an in-the-loop augmentation scheme and physics-based loss criteria to avoid bias towards a predefined training material set and to encourage the model to learn physically consistent inverse mappings. Monte-Carlo sampling of the model’s conditioned posterior delivers a sought emissivity distribution and allows for interpretable uncertainty quantification. Moreover, a distribution-based material matching scheme is presented to return a set of likely material matches for an inferred emissivity distribution. Hence, we present a strategy to incorporate contextual information about a given HSI scene, capture the possible variation of underlying material spectra, and provide interpretable probability measures of a candidate material accounting for given remotely-sensed radiance measurement.

[303] Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks

Nathan Breslow

Main category: cs.LG

TL;DR: DAMP, with MLP-based channel mixing, outperforms DARC in generalization, suggesting robust learning for neural program synthesis.

Details

Motivation: To study the impact of channel-wise mixing via MLPs on the generalization of recurrent convolutional networks.

Method: Compare DARC (simple recurrent convolutional) and DAMP (DARC with gated MLP for channel mixing) using the Re-ARC benchmark.

Result: DAMP significantly outperforms DARC in in-distribution and out-of-distribution generalization.

Conclusion: Explicit channel mixing via MLPs enhances robustness and generalizability, making DAMP promising for neural program synthesis.

Abstract: We investigate the impact of channel-wise mixing via multi-layer perceptrons (MLPs) on the generalization capabilities of recurrent convolutional networks. Specifically, we compare two architectures: DARC (Depth Aware Recurrent Convolution), which employs a simple recurrent convolutional structure, and DAMP (Depth Aware Multi-layer Perceptron), which extends DARC with a gated MLP for channel mixing. Using the Re-ARC benchmark, we find that DAMP significantly outperforms DARC in both in-distribution and out-of-distribution generalization under exact-match grading criteria. These results suggest that explicit channel mixing through MLPs enables recurrent convolutional networks to learn more robust and generalizable computational patterns. Our findings have implications for neural program synthesis and highlight the potential of DAMP as a target architecture for hypernetwork approaches.

[304] Comparative study of machine learning and statistical methods for automatic identification and quantification in γ-ray spectrometry

Dinh Triem Phan, Jérôme Bobin, Cheick Thiam, Christophe Bobin

Main category: cs.LG

TL;DR: The paper proposes an open-source benchmark for evaluating numerical methods in γ-ray spectrometry, comparing machine learning and statistical approaches across three scenarios. The statistical method outperforms machine learning in identification but is sensitive to modeling errors.

Details

Motivation: The lack of common benchmarks for evaluating numerical methods in γ-ray spectrometry makes comparison difficult. The paper aims to address this gap.

Method: An open-source benchmark with simulated datasets, codes, and metrics is used to compare end-to-end machine learning and statistical unmixing approaches under three scenarios: known, deformed, and shifted spectral signatures.

Result: The statistical approach consistently outperforms machine learning in identification but is less robust to modeling errors. Machine learning is viable under uncertain conditions. For quantification, the statistical method is more accurate.

Conclusion: The statistical approach is best for known or well-modeled signatures, while machine learning is suitable for uncertain conditions. The benchmark facilitates future comparisons.

Abstract: During the last decade, a large number of different numerical methods have been proposed to tackle the automatic identification and quantification in {\gamma}-ray spectrometry. However, the lack of common benchmarks, including datasets, code and comparison metrics, makes their evaluation and comparison hard. In that context, we propose an open-source benchmark that comprises simulated datasets of various {\gamma}-spectrometry settings, codes of different analysis approaches and evaluation metrics. This allows us to compare the state-of-the-art end-to-end machine learning with a statistical unmixing approach using the full spectrum. Three scenarios have been investigated: (1) spectral signatures are assumed to be known; (2) spectral signatures are deformed due to physical phenomena such as Compton scattering and attenuation; and (3) spectral signatures are shifted (e.g., due to temperature variation). A large dataset of 200000 simulated spectra containing nine radionuclides with an experimental natural background is used for each scenario with multiple radionuclides present in the spectrum. Regarding identification performance, the statistical approach consistently outperforms the machine learning approaches across all three scenarios for all comparison metrics. However, the performance of the statistical approach can be significantly impacted when spectral signatures are not modeled correctly. Consequently, the full-spectrum statistical approach is most effective with known or well-modeled spectral signatures, while end-to-end machine learning is a good alternative when measurement conditions are uncertain for radionuclide identification. Concerning the quantification task, the statistical approach provides accurate estimates of radionuclide counting, while the machine learning methods deliver less satisfactory results.

[305] Weather-Driven Agricultural Decision-Making Using Digital Twins Under Imperfect Conditions

Tamim Ahmed, Monowar Hasan

Main category: cs.LG

TL;DR: Digital twin tech improves agricultural decision-making by detecting weather data inconsistencies via the Cerealia framework, tested on real-world data.

Details

Motivation: Enhance data-driven decision-making in agriculture by addressing inconsistencies in weather data.

Method: Developed Cerealia, a modular framework using neural networks for anomaly detection, tested on NVIDIA Jetson Orin with real-world and public datasets.

Result: Cerealia successfully detects inconsistencies in weather data, aiding informed decision-making.

Conclusion: Digital twins and Cerealia offer practical solutions for improving agricultural data reliability and automation.

Abstract: By offering a dynamic, real-time virtual representation of physical systems, digital twin technology can enhance data-driven decision-making in digital agriculture. Our research shows how digital twins are useful for detecting inconsistencies in agricultural weather data measurements, which are key attributes for various agricultural decision-making and automation tasks. We develop a modular framework named Cerealia that allows end-users to check for data inconsistencies when perfect weather feeds are unavailable. Cerealia uses neural network models to check anomalies and aids end-users in informed decision-making. We develop a prototype of Cerealia using the NVIDIA Jetson Orin platform and test it with an operational weather network established in a commercial orchard as well as publicly available weather datasets.

[306] HSA-Net: Hierarchical and Structure-Aware Framework for Efficient and Scalable Molecular Language Modeling

Zihang Shao, Wentao Lei, Lei Wang, Wencai Ye, Li Liu

Main category: cs.LG

TL;DR: HSA-Net addresses the global-local trade-off in molecular representation learning by combining cross-attention and Graph-Mamba projectors hierarchically, outperforming SOTA methods.

Details

Motivation: GNNs suffer from over-smoothing in deep layers, and existing methods fail to balance global and local feature preservation.

Method: Proposes HSA-Net with Hierarchical Adaptive Projector (HAP) for layer-specific feature projection and Source-Aware Fusion (SAF) for adaptive feature merging.

Result: HSA-Net outperforms current SOTA methods in experiments.

Conclusion: HSA-Net effectively resolves the global-local trade-off, enhancing molecular representation learning.

Abstract: Molecular representation learning, a cornerstone for downstream tasks like molecular captioning and molecular property prediction, heavily relies on Graph Neural Networks (GNN). However, GNN suffers from the over-smoothing problem, where node-level features collapse in deep GNN layers. While existing feature projection methods with cross-attention have been introduced to mitigate this issue, they still perform poorly in deep features. This motivated our exploration of using Mamba as an alternative projector for its ability to handle complex sequences. However, we observe that while Mamba excels at preserving global topological information from deep layers, it neglects fine-grained details in shallow layers. The capabilities of Mamba and cross-attention exhibit a global-local trade-off. To resolve this critical global-local trade-off, we propose Hierarchical and Structure-Aware Network (HSA-Net), a novel framework with two modules that enables a hierarchical feature projection and fusion. Firstly, a Hierarchical Adaptive Projector (HAP) module is introduced to process features from different graph layers. It learns to dynamically switch between a cross-attention projector for shallow layers and a structure-aware Graph-Mamba projector for deep layers, producing high-quality, multi-level features. Secondly, to adaptively merge these multi-level features, we design a Source-Aware Fusion (SAF) module, which flexibly selects fusion experts based on the characteristics of the aggregation features, ensuring a precise and effective final representation fusion. Extensive experiments demonstrate that our HSA-Net framework quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods.

[307] REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Nameer Hirschkind, Joseph Liu, Xiao Yu, Mahesh Kumar Nandwana

Main category: cs.LG

TL;DR: REINA is a novel loss function for Simultaneous Speech Translation (SimulST) that optimizes the tradeoff between translation quality and latency by waiting for more input only when beneficial. It achieves SOTA results and improves efficiency by up to 21%.

Details

Motivation: SimulST systems struggle with balancing translation quality and latency. The goal is to optimize this tradeoff by dynamically deciding when to wait for more input.

Method: Introduces REINA, a loss function derived from information theory, to train an adaptive policy using a non-streaming translation model. Evaluated on French, Spanish, and German translations.

Result: Achieves SOTA streaming results with comparable model sizes and improves the latency/quality tradeoff by up to 21%.

Conclusion: REINA effectively optimizes the latency/quality tradeoff in SimulST, demonstrating significant improvements over prior methods.

Abstract: Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

[308] SHeRL-FL: When Representation Learning Meets Split Learning in Hierarchical Federated Learning

Dung T. Tran, Nguyen B. Ha, Van-Dinh Nguyen, Kok-Seng Wong

Main category: cs.LG

TL;DR: SHeRL-FL integrates split learning and hierarchical FL to reduce communication costs and coordination complexity, improving scalability and efficiency in federated learning.

Details

Motivation: Existing FL frameworks overlook computational heterogeneity and high communication costs, while hybrid methods like SplitFed introduce training complexity.

Method: SHeRL-FL combines split learning and hierarchical model aggregation with representation learning at intermediate layers, enabling independent computation by clients and edge servers.

Result: Experiments show SHeRL-FL reduces data transmission by over 90% compared to centralized FL and HierFL, and by 50% compared to SplitFed, while improving performance.

Conclusion: SHeRL-FL effectively addresses communication and coordination challenges in FL, offering a scalable and efficient solution for collaborative model training.

Abstract: Federated learning (FL) is a promising approach for addressing scalability and latency issues in large-scale networks by enabling collaborative model training without requiring the sharing of raw data. However, existing FL frameworks often overlook the computational heterogeneity of edge clients and the growing training burden on resource-limited devices. However, FL suffers from high communication costs and complex model aggregation, especially with large models. Previous works combine split learning (SL) and hierarchical FL (HierFL) to reduce device-side computation and improve scalability, but this introduces training complexity due to coordination across tiers. To address these issues, we propose SHeRL-FL, which integrates SL and hierarchical model aggregation and incorporates representation learning at intermediate layers. By allowing clients and edge servers to compute training objectives independently of the cloud, SHeRL-FL significantly reduces both coordination complexity and communication overhead. To evaluate the effectiveness and efficiency of SHeRL-FL, we performed experiments on image classification tasks using CIFAR-10, CIFAR-100, and HAM10000 with AlexNet, ResNet-18, and ResNet-50 in both IID and non-IID settings. In addition, we evaluate performance on image segmentation tasks using the ISIC-2018 dataset with a ResNet-50-based U-Net. Experimental results demonstrate that SHeRL-FL reduces data transmission by over 90% compared to centralized FL and HierFL, and by 50% compared to SplitFed, which is a hybrid of FL and SL, and further improves hierarchical split learning methods.

[309] Combat Urban Congestion via Collaboration: Heterogeneous GNN-based MARL for Coordinated Platooning and Traffic Signal Control

Xianyue Peng, Shenyang Chen, Hang Gao, Hao Wang, H. Michael Zhang

Main category: cs.LG

TL;DR: The paper proposes a heterogeneous graph multi-agent reinforcement learning approach to jointly control traffic signals and vehicle platooning, addressing coordination challenges and improving traffic flow.

Details

Motivation: To tackle the challenges of jointly controlling traffic signals and vehicle platooning in real-time, including heterogeneity and coordination issues.

Method: Uses distinct RL agents for platoon and signal control, integrates graph neural networks for coordination, and employs alternating optimization for training.

Result: SUMO simulations show improved travel time and fuel consumption, outperforming other adaptive signal control methods.

Conclusion: The proposed method effectively addresses coordination challenges and enhances traffic flow through joint control of signals and platooning.

Abstract: Over the years, reinforcement learning has emerged as a popular approach to develop signal control and vehicle platooning strategies either independently or in a hierarchical way. However, jointly controlling both in real-time to alleviate traffic congestion presents new challenges, such as the inherent physical and behavioral heterogeneity between signal control and platooning, as well as coordination between them. This paper proposes an innovative solution to tackle these challenges based on heterogeneous graph multi-agent reinforcement learning and traffic theories. Our approach involves: 1) designing platoon and signal control as distinct reinforcement learning agents with their own set of observations, actions, and reward functions to optimize traffic flow; 2) designing coordination by incorporating graph neural networks within multi-agent reinforcement learning to facilitate seamless information exchange among agents on a regional scale; 3) applying alternating optimization for training, allowing agents to update their own policies and adapt to other agents’ policies. We evaluate our approach through SUMO simulations, which show convergent results in terms of both travel time and fuel consumption, and superior performance compared to other adaptive signal control methods.

[310] Fuzzy-Pattern Tsetlin Machine

Artem Hnilov

Main category: cs.LG

TL;DR: The paper introduces the Fuzzy-Pattern Tsetlin Machine (FPTM), a variant of the Tsetlin Machine that uses fuzzy clause evaluation to improve efficiency, reduce resource usage, and enhance accuracy.

Details

Motivation: The strict 'all-or-nothing' clause evaluation in traditional Tsetlin Machines requires excessive clauses for competitive accuracy, motivating a more flexible approach.

Method: FPTM employs fuzzy clause evaluation, allowing partial contributions from literals, reducing the need for numerous clauses and improving adaptability.

Result: FPTM achieves significant improvements in accuracy, resource efficiency, and speed, outperforming traditional TMs and other models on various datasets.

Conclusion: FPTM offers a scalable, efficient, and accurate alternative to traditional Tsetlin Machines, suitable for resource-constrained environments.

Abstract: The “all-or-nothing” clause evaluation strategy is a core mechanism in the Tsetlin Machine (TM) family of algorithms. In this approach, each clause - a logical pattern composed of binary literals mapped to input data - is disqualified from voting if even a single literal fails. Due to this strict requirement, standard TMs must employ thousands of clauses to achieve competitive accuracy. This paper introduces the Fuzzy-Pattern Tsetlin Machine (FPTM), a novel variant where clause evaluation is fuzzy rather than strict. If some literals in a clause fail, the remaining ones can still contribute to the overall vote with a proportionally reduced score. As a result, each clause effectively consists of sub-patterns that adapt individually to the input, enabling more flexible, efficient, and robust pattern matching. The proposed fuzzy mechanism significantly reduces the required number of clauses, memory footprint, and training time, while simultaneously improving accuracy. On the IMDb dataset, FPTM achieves 90.15% accuracy with only one clause per class, a 50x reduction in clauses and memory over the Coalesced Tsetlin Machine. FPTM trains up to 316x faster (45 seconds vs. 4 hours) and fits within 50 KB, enabling online learning on microcontrollers. Inference throughput reaches 34.5 million predictions/second (51.4 GB/s). On Fashion-MNIST, accuracy reaches 92.18% (2 clauses), 93.19% (20 clauses) and 94.68% (8000 clauses), a ~400x clause reduction compared to the Composite TM’s 93.00% (8000 clauses). On the Amazon Sales dataset with 20% noise, FPTM achieves 85.22% accuracy, significantly outperforming the Graph Tsetlin Machine (78.17%) and a Graph Convolutional Neural Network (66.23%).

[311] Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

Elon Litman

Main category: cs.LG

TL;DR: The paper justifies the scaled-dot-product attention (SDPA) mechanism from first principles, linking it to Entropic Optimal Transport (EOT) and revealing its forward pass as an optimal inference step and backward pass as a policy gradient update.

Details

Motivation: To provide a principled mathematical foundation for SDPA, often heuristically motivated, by connecting it to EOT and reinforcement learning.

Method: The authors show SDPA’s forward pass solves a degenerate EOT problem and prove its backward pass gradient matches an advantage-based policy gradient. They analyze the induced information geometry.

Result: SDPA’s forward pass is optimal inference, and its backward pass is a natural, manifold-aware learning update due to the EOT formulation.

Conclusion: The work unifies SDPA as a principled mechanism with forward inference and backward learning grounded in optimization and geometry.

Abstract: The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

[312] Regret minimization in Linear Bandits with offline data via extended D-optimal exploration

Sushant Vijayan, Arun Suggala, Karthikeyan VS, Soumyabrata Pal

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is often available, such as in recommendation systems, online advertising. Consequently, this problem has been studied intensively in recent literature. Our algorithm, Offline-Online Phased Elimination (OOPE), effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. OOPE achieves an online regret is $\tilde{O}(\sqrt{\deff T \log \left(|\mathcal{A}|T\right)}+d^2)$. $\deff \leq d)$ is the effective problem dimension which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(\lambda_k){k \in [d]}$ of the Gram matrix of the offline data. The eigen-spectrum $(\lambda_k){k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($\deff \approx d$), we recover the established regret bounds for purely online setting while, when offline data is abundant ($\Toff » T$) and well-explored ($\deff = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{\deff} \min { \deff,1} \right)$, which can be substantial in high dimensions with moderate quality of offline data $\deff = \Omega(1)$.

[313] DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System

Hui Yi Leong, Yuqing Wu

Main category: cs.LG

TL;DR: DynaSwarm is a dynamic multi-agent system (MAS) framework that uses reinforcement learning and adaptive graph selection to improve performance over static designs.

Details

Motivation: Current MAS frameworks use static collaboration graphs, limiting adaptability and performance.

Method: DynaSwarm employs an actor-critic RL mechanism for stable graph optimization and a dynamic graph selector for adaptive routing. It also fine-tunes a demonstration retriever for in-context learning.

Result: DynaSwarm outperforms state-of-the-art single-agent and MAS baselines in tasks like question answering, math reasoning, and coding.

Conclusion: Sample-aware structural flexibility is crucial for effective LLM-based MAS designs.

Abstract: Current multi-agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance. To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM-based MAS through two key innovations: (1) an actor-critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning. DynaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks. (c) We propose to fine-tune the demonstration retriever to fully exploit the power of in-context learning (ICL). Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs.

[314] Fast weight programming and linear transformers: from machine learning to neurobiology

Kazuki Irie, Samuel J. Gershman

Main category: cs.LG

TL;DR: The paper reviews Fast Weight Programmers (FWPs), a type of RNN with 2D matrix-form hidden states, highlighting their dynamic synaptic weights, computational traits, and links to transformers, state space models, and brain synaptic plasticity.

Details

Motivation: To explore and explain the technical foundations and computational properties of FWPs, emphasizing their role as dynamic memory systems and their connections to broader AI and neuroscience concepts.

Method: Review of FWPs, including their architecture (2D-state RNNs), dynamic weight programming by a trained network, and comparisons to transformers and state space models.

Result: FWPs demonstrate unique computational capabilities as dynamic memory systems, with parallels to transformers and state space models, and potential insights into brain-like synaptic plasticity.

Conclusion: FWPs represent a significant advancement in neural networks, bridging artificial intelligence and neuroscience, with implications for future AI models and understanding natural intelligence.

Abstract: Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.

[315] Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Muralikrishnna G. Sethuraman, Faramarz Fekri

Main category: cs.LG

TL;DR: DCCD-CONF is a novel framework for learning nonlinear cyclic causal graphs with unmeasured confounders, outperforming existing methods in accuracy and scalability.

Details

Motivation: Real-world systems often violate assumptions of fully observed variables and acyclic graphs, especially in fields like biology. Existing methods are limited by linearity assumptions or scalability issues.

Method: DCCD-CONF alternates between optimizing graph structure and estimating confounder distribution by maximizing log-likelihood of interventional data.

Result: Experiments on synthetic and real-world gene data show DCCD-CONF outperforms state-of-the-art methods in causal graph recovery and confounder identification.

Conclusion: The framework provides consistency guarantees and addresses key limitations in causal discovery, making it theoretically sound and practically effective.

Abstract: Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.

[316] Enhanced Liver Tumor Detection in CT Images Using 3D U-Net and Bat Algorithm for Hyperparameter Optimization

Nastaran Ghorbani, Bitasadat Jamshidi, Mohsen Rostamy-Malkhalifeh

Main category: cs.LG

TL;DR: A novel method for liver tumor segmentation in CT images combines 3D U-Net with the Bat Algorithm for hyperparameter optimization, improving accuracy and robustness.

Details

Motivation: Early detection of liver cancer is critical, and automated segmentation can enhance diagnostic precision.

Method: Integration of 3D U-Net architecture with the Bat Algorithm to optimize hyperparameters like learning rate and batch size.

Result: High F1-score at lower prediction thresholds, balancing precision and recall, suitable for clinical diagnostics.

Conclusion: The synergy of deep learning and metaheuristic optimization offers an effective solution for complex medical image segmentation.

Abstract: Liver cancer is one of the most prevalent and lethal forms of cancer, making early detection crucial for effective treatment. This paper introduces a novel approach for automated liver tumor segmentation in computed tomography (CT) images by integrating a 3D U-Net architecture with the Bat Algorithm for hyperparameter optimization. The method enhances segmentation accuracy and robustness by intelligently optimizing key parameters like the learning rate and batch size. Evaluated on a publicly available dataset, our model demonstrates a strong ability to balance precision and recall, with a high F1-score at lower prediction thresholds. This is particularly valuable for clinical diagnostics, where ensuring no potential tumors are missed is paramount. Our work contributes to the field of medical image analysis by demonstrating that the synergy between a robust deep learning architecture and a metaheuristic optimization algorithm can yield a highly effective solution for complex segmentation tasks.

[317] Discrete Diffusion-Based Model-Level Explanation of Heterogeneous GNNs with Node Features

Pallabee Das, Stefan Heindorf

Main category: cs.LG

TL;DR: DiGNNExplainer is a model-level explanation method for heterogeneous graphs, generating realistic node features via discrete denoising diffusion, outperforming existing methods in faithfulness and realism.

Details

Motivation: Existing HGNN explanation methods lack support for realistic node features and fail to provide faithful explanations, limiting their practical utility.

Method: DiGNNExplainer uses discrete denoising diffusion to synthesize heterogeneous graphs with realistic node features, addressing the limitation of continuous-space approaches.

Result: DiGNNExplainer outperforms state-of-the-art methods, producing explanations that are both realistic and faithful to the model’s decisions.

Conclusion: DiGNNExplainer effectively addresses gaps in HGNN explanations, offering a practical solution for interpretable heterogeneous graph analysis.

Abstract: Many real-world datasets, such as citation networks, social networks, and molecular structures, are naturally represented as heterogeneous graphs, where nodes belong to different types and have additional features. For example, in a citation network, nodes representing “Paper” or “Author” may include attributes like keywords or affiliations. A critical machine learning task on these graphs is node classification, which is useful for applications such as fake news detection, corporate risk assessment, and molecular property prediction. Although Heterogeneous Graph Neural Networks (HGNNs) perform well in these contexts, their predictions remain opaque. Existing post-hoc explanation methods lack support for actual node features beyond one-hot encoding of node type and often fail to generate realistic, faithful explanations. To address these gaps, we propose DiGNNExplainer, a model-level explanation approach that synthesizes heterogeneous graphs with realistic node features via discrete denoising diffusion. In particular, we generate realistic discrete features (e.g., bag-of-words features) using diffusion models within a discrete space, whereas previous approaches are limited to continuous spaces. We evaluate our approach on multiple datasets and show that DiGNNExplainer produces explanations that are realistic and faithful to the model’s decision-making, outperforming state-of-the-art methods.

[318] Sparse Partial Optimal Transport via Quadratic Regularization

Khang Tran, Khoa Nguyen, Anh Nguyen, Thong Huynh, Son Pham, Sy-Hoang Nguyen-Dang, Manh Pham, Bang Vo, Mai Ngoc Tran, Mai Ngoc Tran, Dung Luong

Main category: cs.LG

TL;DR: The paper introduces Quadratic Regularized Partial Optimal Transport (QPOT), a novel formulation that sparsifies transport plans, addressing the limitations of dense plans from entropic regularization in POT.

Details

Motivation: Existing POT solvers use entropic regularization, leading to dense transport plans, which are unsuitable for applications requiring sparsity. QPOT aims to provide a sparser alternative.

Method: The authors propose QPOT, a quadratic regularized formulation of POT, to induce sparsity in transport plans.

Result: Experiments on synthetic and real-world datasets (e.g., CIFAR-10, color transfer, domain adaptation) show QPOT achieves improved sparsity and performance.

Conclusion: QPOT offers a sparser and effective alternative to entropic POT, enhancing its applicability in sparsity-favoring scenarios.

Abstract: Partial Optimal Transport (POT) has recently emerged as a central tool in various Machine Learning (ML) applications. It lifts the stringent assumption of the conventional Optimal Transport (OT) that input measures are of equal masses, which is often not guaranteed in real-world datasets, and thus offers greater flexibility by permitting transport between unbalanced input measures. Nevertheless, existing major solvers for POT commonly rely on entropic regularization for acceleration and thus return dense transport plans, hindering the adoption of POT in various applications that favor sparsity. In this paper, as an alternative approach to the entropic POT formulation in the literature, we propose a novel formulation of POT with quadratic regularization, hence termed quadratic regularized POT (QPOT), which induces sparsity to the transport plan and consequently facilitates the adoption of POT in many applications with sparsity requirements. Extensive experiments on synthetic and CIFAR-10 datasets, as well as real-world applications such as color transfer and domain adaptations, consistently demonstrate the improved sparsity and favorable performance of our proposed QPOT formulation.

[319] Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

Jihyun Lim, Junhyuk Jo, Chanhyeok Ko, Young Min Go, Jimin Hwa, Sunwoo Lee

Main category: cs.LG

TL;DR: The paper proposes a system-aware local SGD method for efficient neural network training in heterogeneous computing environments, balancing workload and introducing controlled bias to outperform synchronous SGD.

Details

Motivation: Traditional methods like synchronous SGD with data parallelism are inefficient in heterogeneous systems, leading to underutilization of slower resources like CPUs.

Method: A system-aware local SGD method allocates workloads proportionally to compute capacity and introduces controlled bias in data sampling and model aggregation.

Result: The method accelerates training in heterogeneous environments, achieving comparable or higher accuracy than synchronous SGD within the same time.

Conclusion: The strategy is adaptable to diverse heterogeneous environments, offering a scalable solution for efficient training.

Abstract: Most large-scale neural network training methods assume homogeneous parallel computing resources. For example, synchronous SGD with data parallelism, the most widely used parallel training strategy, incurs significant synchronization overhead when workers process their assigned data at different speeds. Consequently, in systems with heterogeneous compute resources, users often rely solely on the fastest components, such as GPUs, for training. In this work, we explore how to effectively use heterogeneous resources for neural network training. We propose a system-aware local stochastic gradient descent (local SGD) method that allocates workloads to each compute resource in proportion to its compute capacity. To make better use of slower resources such as CPUs, we intentionally introduce bias into data sampling and model aggregation. Our study shows that well-controlled bias can significantly accelerate local SGD in heterogeneous environments, achieving comparable or even higher accuracy than synchronous SGD with data-parallelism within the same time budget. This fundamental parallelization strategy can be readily extended to diverse heterogeneous environments, including cloud platforms and multi-node high-performance computing clusters.

[320] M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction

Guangyin Jin, Sicong Lai, Xiaoshuai Hao, Mingtao Zhang, Jinlei Zhang

Main category: cs.LG

TL;DR: The paper introduces M3-Net, a graph-free MLP-based model for traffic prediction, addressing limitations of existing methods by using time series and spatio-temporal embeddings with a novel MLP-Mixer and MoE mechanism.

Details

Motivation: Existing deep learning methods for traffic prediction rely on complex structures or designs, hindering efficient deployment on large-scale datasets.

Method: Proposes M3-Net, a cost-effective MLP-based model with time series and spatio-temporal embeddings, and a novel MLP-Mixer with MoE.

Result: Extensive experiments show M3-Net outperforms in prediction performance and lightweight deployment.

Conclusion: M3-Net offers a superior, efficient solution for traffic prediction without relying on complex structures.

Abstract: Achieving accurate traffic prediction is a fundamental but crucial task in the development of current intelligent transportation systems.Most of the mainstream methods that have made breakthroughs in traffic prediction rely on spatio-temporal graph neural networks, spatio-temporal attention mechanisms, etc. The main challenges of the existing deep learning approaches are that they either depend on a complete traffic network structure or require intricate model designs to capture complex spatio-temporal dependencies. These limitations pose significant challenges for the efficient deployment and operation of deep learning models on large-scale datasets. To address these challenges, we propose a cost-effective graph-free Multilayer Perceptron (MLP) based model M3-Net for traffic prediction. Our proposed model not only employs time series and spatio-temporal embeddings for efficient feature processing but also first introduces a novel MLP-Mixer architecture with a mixture of experts (MoE) mechanism. Extensive experiments conducted on multiple real datasets demonstrate the superiority of the proposed model in terms of prediction performance and lightweight deployment.

[321] UQGNN: Uncertainty Quantification of Graph Neural Networks for Multivariate Spatiotemporal Prediction

Dahai Yu, Dingyi Zhuang, Lin Jiang, Rongchao Xu, Xinyue Ye, Yuheng Bu, Shenhao Wang, Guang Wang

Main category: cs.LG

TL;DR: UQGNN, a Graph Neural Network with Uncertainty Quantification, improves spatiotemporal prediction by capturing complex interactions and quantifying uncertainty, outperforming existing models.

Details

Motivation: Existing spatiotemporal prediction models are deterministic or focus on single phenomena, neglecting correlations among heterogeneous urban phenomena and uncertainty quantification.

Method: UQGNN integrates an Interaction-aware Spatiotemporal Embedding Module (multivariate diffusion graph convolutional network and temporal convolutional network) and a multivariate probabilistic prediction module.

Result: UQGNN outperforms state-of-the-art baselines, achieving a 5% improvement in prediction accuracy and uncertainty quantification on the Shenzhen dataset.

Conclusion: UQGNN addresses the limitations of existing models by effectively capturing spatiotemporal interactions and quantifying uncertainty, demonstrating superior performance.

Abstract: Spatiotemporal prediction plays a critical role in numerous real-world applications such as urban planning, transportation optimization, disaster response, and pandemic control. In recent years, researchers have made significant progress by developing advanced deep learning models for spatiotemporal prediction. However, most existing models are deterministic, i.e., predicting only the expected mean values without quantifying uncertainty, leading to potentially unreliable and inaccurate outcomes. While recent studies have introduced probabilistic models to quantify uncertainty, they typically focus on a single phenomenon (e.g., taxi, bike, crime, or traffic crashes), thereby neglecting the inherent correlations among heterogeneous urban phenomena. To address the research gap, we propose a novel Graph Neural Network with Uncertainty Quantification, termed UQGNN for multivariate spatiotemporal prediction. UQGNN introduces two key innovations: (i) an Interaction-aware Spatiotemporal Embedding Module that integrates a multivariate diffusion graph convolutional network and an interaction-aware temporal convolutional network to effectively capture complex spatial and temporal interaction patterns, and (ii) a multivariate probabilistic prediction module designed to estimate both expected mean values and associated uncertainties. Extensive experiments on four real-world multivariate spatiotemporal datasets from Shenzhen, New York City, and Chicago demonstrate that UQGNN consistently outperforms state-of-the-art baselines in both prediction accuracy and uncertainty quantification. For example, on the Shenzhen dataset, UQGNN achieves a 5% improvement in both prediction accuracy and uncertainty quantification.

[322] SHEFL: Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning

Keumseo Ryum, Jinu Gong, Joonhyuk Kang

Main category: cs.LG

TL;DR: SHEFL is a federated learning framework addressing computational heterogeneity by dynamically adjusting resource allocation and introducing a bias-aware aggregation scheme, improving fairness and performance.

Details

Motivation: Existing FL methods struggle with data/system heterogeneity and communication efficiency, while ensemble-based FL lacks diversity in model predictions.

Method: SHEFL allocates global models based on client resources, introduces bias-aware aggregation, and dynamically adjusts resource ratios.

Result: Experiments show SHEFL effectively handles computational heterogeneity, enhancing fairness and overall performance.

Conclusion: SHEFL outperforms existing methods by better managing resource diversity and client biases in federated learning.

Abstract: Federated learning enables distributed training with private data of clients, but its convergence is hindered by data and system heterogeneity in realistic communication scenarios. Most existing system heterogeneous FL schemes utilize global pruning or ensemble distillation, yet they often overlook typical constraints required for communication efficiency. Meanwhile, deep ensembles can aggregate predictions from individually trained models to improve performance, but current ensemble-based FL methods fall short in fully capturing the diversity of model predictions. In this work, we propose SHEFL, a global ensemble-based federated learning framework suited for clients with diverse computational capacities. We allocate different numbers of global models to clients based on their available resources. We further introduce a novel aggregation scheme that accounts for bias between clients with different computational capabilities. To reduce the computational burden of training deep ensembles and mitigate data bias, we dynamically adjust the resource ratio across clients - aggressively reducing the influence of underpowered clients in constrained scenarios, while increasing their weight in the opposite case. Extensive experiments demonstrate that our method effectively addresses computational heterogeneity, significantly improving both fairness and overall performance compared to existing approaches.

[323] Distributed optimization: designed for federated learning

Wenyou Guo, Ting Qu, Chunrong Pan, George Q. Huang

Main category: cs.LG

TL;DR: The paper proposes a distributed optimization algorithm for Federated Learning using augmented Lagrangian techniques, adaptable to various communication topologies, with theoretical convergence guarantees and strong performance in heterogeneous settings.

Details

Motivation: To address the need for efficient and privacy-preserving distributed ML in cross-organizational data collaboration, focusing on diverse communication topologies and computational efficiency.

Method: Develops distributed optimization algorithms based on augmented Lagrangian, incorporating proximal relaxation and quadratic approximation, with termination criteria and parameter updates.

Result: The framework generalizes classical optimization methods and shows strong performance in large-scale, heterogeneous FL settings.

Conclusion: The proposed algorithm is versatile, theoretically sound, and effective for practical FL applications.

Abstract: Federated Learning (FL), as a distributed collaborative Machine Learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.

[324] Dynamic Rank Adjustment for Accurate and Efficient Neural Network Training

Hyuntak Shin, Aecheon Jung, Sunwoo Lee, Sungeun Hong

Main category: cs.LG

TL;DR: A dynamic-rank training framework is proposed to mitigate rank collapse in low-rank training by interleaving full-rank epochs, achieving comparable accuracy to full-rank training with similar computational cost to low-rank methods.

Details

Motivation: Fixed low-rank structures limit model learning and accelerate rank decline during training, necessitating a solution to restore weight matrix rank.

Method: Interleave full-rank training epochs within low-rank training to dynamically adjust weight matrix rank and prevent collapse.

Result: The framework maintains computational efficiency of low-rank training while matching full-rank training accuracy across benchmarks.

Conclusion: Dynamic-rank training effectively balances computational cost and model performance, addressing limitations of fixed low-rank methods.

Abstract: Low-rank training methods reduce the number of trainable parameters by re-parameterizing the weights with matrix decompositions (e.g., singular value decomposition). However, enforcing a fixed low-rank structure caps the rank of the weight matrices and can hinder the model’s ability to learn complex patterns. Furthermore, the effective rank of the model’s weights tends to decline during training, and this drop is accelerated when the model is reparameterized into a low-rank structure. In this study, we argue that strategically interleaving full-rank training epochs within low-rank training epochs can effectively restore the rank of the model’s weights. Based on our findings, we propose a general dynamic-rank training framework that is readily applicable to a wide range of neural-network tasks. We first describe how to adjust the rank of weight matrix to alleviate the inevitable rank collapse that arises during training, and then present extensive empirical results that validate our claims and demonstrate the efficacy of the proposed framework. Our empirical study shows that the proposed method achieves almost the same computational cost as SVD-based low-rank training while achieving a comparable accuracy to full-rank training across various benchmarks.

[325] Classifier Language Models: Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks

Adit Krishnan, Chu Wang, Chris Kong

Main category: cs.LG

TL;DR: A token-driven sparse finetuning strategy for small language models improves specialized semantic classification tasks by focusing on task-specific tokens, outperforming other methods while reducing training costs.

Details

Motivation: Specialized semantic classification tasks in industry require domain expertise and high inference throughput, making large models impractical. Smaller, customized models are preferred.

Method: Develops a token-driven sparse finetuning strategy, identifying and finetuning a sensitive subset of parameters using task-specific tokens, without adding extra parameters.

Result: Outperforms end-to-end finetuning, LoRA, layer selection, and prefix tuning on five tasks, with greater stability and half the training costs.

Conclusion: The proposed method is effective for adapting small language models to specialized tasks, balancing performance and efficiency.

Abstract: Semantic text classification requires the understanding of the contextual significance of specific tokens rather than surface-level patterns or keywords (as in rule-based or statistical text classification), making large language models (LLMs) well-suited for this task. However, semantic classification applications in industry, like customer intent detection or semantic role labeling, tend to be highly specialized. They require annotation by domain experts in contrast to general-purpose corpora for pretraining. Further, they typically require high inference throughputs which limits the model size from latency and cost perspectives. Thus, for a range of specialized classification tasks, the preferred solution is to develop customized classifiers by finetuning smaller language models (e.g., mini-encoders, small language models). In this work, we develop a token-driven sparse finetuning strategy to adapt small language models to specialized classification tasks. We identify and finetune a small sensitive subset of model parameters by leveraging task-specific token constructs in the finetuning dataset, while leaving most of the pretrained weights unchanged. Unlike adapter approaches such as low rank adaptation (LoRA), we do not introduce additional parameters to the model. Our approach identifies highly relevant semantic tokens (case study in the Appendix) and outperforms end-to-end finetuning, LoRA, layer selection, and prefix tuning on five diverse semantic classification tasks. We achieve greater stability and half the training costs vs. end-to-end finetuning.

[326] MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time

Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, Andrew McCallum

Main category: cs.LG

TL;DR: MiGrATe is an online test-time training method using GRPO for LLMs, balancing exploration and exploitation without external data, outperforming baselines in diverse tasks.

Details

Motivation: Addressing the limitations of in-context learning and hand-crafted training data in black-box optimization tasks with LLMs.

Method: MiGrATe combines GRPO as a search algorithm with mixed-policy group construction (on-policy, greedy, and neighborhood sampling) for online TTT.

Result: Outperforms inference-only and TTT baselines in word search, molecule optimization, and ARC tasks.

Conclusion: MiGrATe shows promise for complex search tasks without external supervision, leveraging online TTT effectively.

Abstract: Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.

[327] $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models

Jiaxin Ju, Yizhen Zheng, Huan Yee Koh, Can Wang, Shirui Pan

Main category: cs.LG

TL;DR: The paper proposes M²LLM, a multi-view framework combining molecular structure, task, and rules views, leveraging LLMs for superior molecular property prediction.

Details

Motivation: Existing molecular representation methods like fingerprints and GNNs lack semantic and contextual knowledge, which LLMs can provide.

Method: M²LLM integrates three perspectives (structure, task, rules) dynamically, using LLMs for embeddings and feature curation.

Result: M²LLM achieves state-of-the-art performance on benchmarks for classification and regression tasks.

Conclusion: LLMs enhance molecular representations through encoding and reasoning, proving effective in property prediction.

Abstract: Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose $\text{M}^{2}$LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that $\text{M}^{2}$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes.

[328] Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran, Yanning Zhang

Main category: cs.LG

TL;DR: A novel approach using a Global Workspace Model (GWM) and multi-level collaborative distillation improves stability and adaptability in Online Class-Incremental Learning (OCIL) under strict memory constraints.

Details

Motivation: Address the challenges of maintaining model stability and adaptability in OCIL, where current methods struggle under strict memory constraints.

Method: Proposes a GWM to fuse student model parameters, capturing historical learning, and a multi-level distillation mechanism for peer-to-peer consistency and knowledge alignment.

Result: Significant performance improvement on three OCIL benchmarks across various memory budgets.

Conclusion: The method balances stability and plasticity, enhancing OCIL model performance under memory constraints.

Abstract: Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams and samples of the data streams can be seen only once, making it more suitable for real-world scenarios compared to offline learning. However, OCIL faces two key challenges: maintaining model stability under strict memory constraints and ensuring adaptability to new tasks. Under stricter memory constraints, current replay-based methods are less effective. While ensemble methods improve adaptability (plasticity), they often struggle with stability. To overcome these challenges, we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. This fused model is then redistributed periodically to the students to stabilize learning and promote cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. This approach enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets.

[329] Expert-Guided Diffusion Planner for Auto-bidding

Yunshan Peng, Wenzheng Shu, Jiahao Sun, Yanxiang Zeng, Jinan Pang, Wentao Bai, Yunke Bai, Xialong Liu, Peng Jiang

Main category: cs.LG

TL;DR: The paper introduces a novel conditional diffusion modeling method for auto-bidding, combining expert trajectory guidance and skip-step sampling to improve efficiency and effectiveness, achieving significant gains in conversion and revenue.

Details

Motivation: Traditional generative bidding lacks personalized structural information and faces timeliness risks, prompting the need for a more robust method.

Method: Proposes a conditional diffusion modeling approach with expert trajectory guidance and skip-step sampling to enhance decision sequence generation.

Result: Offline and online A/B tests showed an 11.29% increase in conversion and 12.35% in revenue compared to baseline.

Conclusion: The proposed method effectively addresses limitations of existing generative bidding, demonstrating superior performance in auto-bidding scenarios.

Abstract: Auto-bidding is extensively applied in advertising systems, serving a multitude of advertisers. Generative bidding is gradually gaining traction due to its robust planning capabilities and generalizability. In contrast to traditional reinforcement learning-based bidding, generative bidding does not rely on the Markov Decision Process (MDP) exhibiting superior planning capabilities in long-horizon scenarios. Conditional diffusion modeling approaches have demonstrated significant potential in the realm of auto-bidding. However, relying solely on return as the optimality condition is weak to guarantee the generation of genuinely optimal decision sequences, lacking personalized structural information. Moreover, diffusion models’ t-step autoregressive generation mechanism inherently carries timeliness risks. To address these issues, we propose a novel conditional diffusion modeling method based on expert trajectory guidance combined with a skip-step sampling strategy to enhance generation efficiency. We have validated the effectiveness of this approach through extensive offline experiments and achieved statistically significant results in online A/B testing, achieving an increase of 11.29% in conversion and a 12.35% in revenue compared with the baseline.

[330] Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman Problem

Michael Li, Eric Bae, Christopher Haberland, Natasha Jaques

Main category: cs.LG

TL;DR: COGS improves neural TSP solvers’ robustness by using generative sampling for training data, outperforming baselines in worst-case scenarios.

Details

Motivation: Neural TSP solvers struggle with generalization on realistic distributions, prompting the need for robust training methods.

Method: COGS samples training data from a generative TSP model to enhance coverage and interpolation.

Result: COGS improves worst-case performance and robustness, validated on synthetic datasets and TSPLib50.

Conclusion: Generative sampling (COGS) effectively addresses distribution robustness in neural TSP solvers.

Abstract: The Traveling Salesman Problem (TSP) is a classic NP-hard combinatorial optimization task with numerous practical applications. Classic heuristic solvers can attain near-optimal performance for small problem instances, but become computationally intractable for larger problems. Real-world logistics problems such as dynamically re-routing last-mile deliveries demand a solver with fast inference time, which has led researchers to investigate specialized neural network solvers. However, neural networks struggle to generalize beyond the synthetic data they were trained on. In particular, we show that there exist TSP distributions that are realistic in practice, which also consistently lead to poor worst-case performance for existing neural approaches. To address this issue of distribution robustness, we present Combinatorial Optimization with Generative Sampling (COGS), where training data is sampled from a generative TSP model. We show that COGS provides better data coverage and interpolation in the space of TSP training distributions. We also present TSPLib50, a dataset of realistically distributed TSP samples, which tests real-world generalization ability without conflating this issue with instance size. We evaluate our method on various synthetic datasets as well as TSPLib50, and compare to state-of-the-art neural baselines. We demonstrate that COGS improves distribution robustness, with most performance gains coming from worst-case scenarios.

[331] Elucidating Rectified Flow with Deterministic Sampler: Polynomial Discretization Complexity for Multi and One-step Models

Ruofeng Yang, Zhaoyu Zhu, Bo Jiang, Cheng Chen, Shuai Li

Main category: cs.LG

TL;DR: The paper proves polynomial discretization complexity for rectified flow (RF)-based models in multi-step and one-step generation, outperforming diffusion models and other popular models like VP and VE-based models.

Details

Motivation: Existing theoretical works on RF-based models lack analysis of discretization complexity, often focusing on stochastic samplers or showing exponential dependence on parameters. This work aims to fill this gap.

Method: Under bounded support assumptions, the paper analyzes RF-based models with deterministic samplers, introducing a Langevin process as a corrector for multi-step settings.

Result: Polynomial discretization complexity is achieved for both multi-step and one-step RF-based models, surpassing diffusion models and prior results.

Conclusion: This work provides the first theoretical understanding of RF-based models’ empirical success, marking progress in analyzing their performance.

Abstract: Recently, rectified flow (RF)-based models have achieved state-of-the-art performance in many areas for both the multi-step and one-step generation. However, only a few theoretical works analyze the discretization complexity of RF-based models. Existing works either focus on flow-based models with stochastic samplers or establish complexity results that exhibit exponential dependence on problem parameters. In this work, under the realistic bounded support assumption, we prove the first polynomial discretization complexity for multi-step and one-step RF-based models with a deterministic sampler simultaneously. For the multi-step setting, inspired by the predictor-corrector framework of diffusion models, we introduce a Langevin process as a corrector and show that RF-based models can achieve better polynomial discretization complexity than diffusion models. To achieve this result, we conduct a detailed analysis of the RF-based model and explain why it is better than previous popular models, such as variance preserving (VP) and variance exploding (VE)-based models. Based on the observation of multi-step RF-based models, we further provide the first polynomial discretization complexity result for one-step RF-based models, improving upon prior results for one-step diffusion-based models. These findings mark the first step toward theoretically understanding the impressive empirical performance of RF-based models in both multi-step and one-step generation.

[332] Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

Main category: cs.LG

TL;DR: SARM integrates a Sparse Autoencoder into reward models to improve interpretability and adaptability to preference shifts, outperforming traditional models.

Details

Motivation: Traditional reward models lack interpretability and flexibility, hindering effective alignment of LLMs with human values.

Method: SARM uses a pretrained Sparse Autoencoder to map LLM activations into an interpretable feature space, enabling transparent reward scoring.

Result: SARM provides feature-level attribution, adapts to preference shifts, and achieves better alignment performance.

Conclusion: SARM offers a scalable and interpretable solution for aligning LLMs with human preferences.

Abstract: Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (\textbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

[333] Differentiated Information Mining: A Semi-supervised Learning Framework for GNNs

Long Wang, Kai Liu

Main category: cs.LG

TL;DR: DiFac is a semi-supervised framework for GNNs that derives differentiated factors from a single source, enforces their consistency, and integrates auxiliary information to improve robustness and generalization.

Details

Motivation: Addressing the challenge of obtaining mutually independent decision factors in SSL for GNNs, which is crucial to avoid pseudo-label bias and training collapse.

Method: DiFac extracts differentiated factors from a single source, removes conflicting samples, ranks pseudo-labels, and incorporates auxiliary textual knowledge via a multimodal language model.

Result: DiFac outperforms baselines in low-label regimes, enhancing robustness and generalization on benchmark datasets.

Conclusion: DiFac effectively mitigates pseudo-label bias and overconfidence, leveraging both intrinsic and auxiliary factors for improved SSL performance.

Abstract: In semi-supervised learning (SSL) for enhancing the performance of graph neural networks (GNNs) with unlabeled data, introducing mutually independent decision factors for cross-validation is regarded as an effective strategy to alleviate pseudo-label confirmation bias and training collapse. However, obtaining such factors is challenging in practice: additional and valid information sources are inherently scarce, and even when such sources are available, their independence from the original source cannot be guaranteed. To address this challenge, In this paper we propose a Differentiated Factor Consistency Semi-supervised Framework (DiFac), which derives differentiated factors from a single information source and enforces their consistency. During pre-training, the model learns to extract these factors; in training, it iteratively removes samples with conflicting factors and ranks pseudo-labels based on the shortest stave principle, selecting the top candidate samples to reduce overconfidence commonly observed in confidence-based or ensemble-based methods. Our framework can also incorporate additional information sources. In this work, we leverage the large multimodal language model to introduce latent textual knowledge as auxiliary decision factors, and we design a accountability scoring mechanism to mitigate additional erroneous judgments introduced by these auxiliary factors. Experiments on multiple benchmark datasets demonstrate that DiFac consistently improves robustness and generalization in low-label regimes, outperforming other baseline methods.

[334] TechOps: Technical Documentation Templates for the AI Act

Laura Lucaj, Alex Loosley, Hakan Jonsson, Urs Gasser, Patrick van der Smagt

Main category: cs.LG

TL;DR: The paper introduces open-source templates for AI documentation to ensure compliance with the EU AI Act, covering data, models, and applications throughout the AI lifecycle.

Details

Motivation: Existing documentation templates fail to fully meet the AI Act's requirements, necessitating a solution for transparency, traceability, and accountability.

Method: The authors develop and refine templates for data, models, and applications, validated through user feedback and real-world scenarios.

Result: The templates enhance traceability, reproducibility, and compliance, demonstrated via practical examples like a skin tones dataset and a neural network for silhouette segmentation.

Conclusion: The templates effectively support regulatory compliance and responsible AI development, serving as a practical tool for oversight.

Abstract: Operationalizing the EU AI Act requires clear technical documentation to ensure AI systems are transparent, traceable, and accountable. Existing documentation templates for AI systems do not fully cover the entire AI lifecycle while meeting the technical documentation requirements of the AI Act. This paper addresses those shortcomings by introducing open-source templates and examples for documenting data, models, and applications to provide sufficient documentation for certifying compliance with the AI Act. These templates track the system status over the entire AI lifecycle, ensuring traceability, reproducibility, and compliance with the AI Act. They also promote discoverability and collaboration, reduce risks, and align with best practices in AI documentation and governance. The templates are evaluated and refined based on user feedback to enable insights into their usability and implementability. We then validate the approach on real-world scenarios, providing examples that further guide their implementation: the data template is followed to document a skin tones dataset created to support fairness evaluations of downstream computer vision models and human-centric applications; the model template is followed to document a neural network for segmenting human silhouettes in photos. The application template is tested on a system deployed for construction site safety using real-time video analytics and sensor data. Our results show that TechOps can serve as a practical tool to enable oversight for regulatory compliance and responsible AI development.

[335] TempOpt – Unsupervised Alarm Relation Learning for Telecommunication Networks

Sathiyanaryanan Sampath, Pratyush Uppuluri, Thirumaran Ekambaram

Main category: cs.LG

TL;DR: The paper introduces TempOpt, an unsupervised alarm relation learning technique for efficient root alarm identification in telecommunications networks.

Details

Motivation: Handling the enormous volume of interconnected alarms in telecommunications networks is challenging, requiring better methods to learn alarm relations for accurate root cause analysis.

Method: Proposes TempOpt, a novel unsupervised technique for learning alarm relations, overcoming limitations of temporal dependency methods.

Result: Experiments on real-world datasets show TempOpt improves the quality of learned alarm relations compared to existing methods.

Conclusion: TempOpt is a practical and effective solution for alarm relation learning in network fault monitoring.

Abstract: In a telecommunications network, fault alarms generated by network nodes are monitored in a Network Operations Centre (NOC) to ensure network availability and continuous network operations. The monitoring process comprises of tasks such as active alarms analysis, root alarm identification, and resolution of the underlying problem. Each network node potentially can generate alarms of different types, while nodes can be from multiple vendors, a network can have hundreds of nodes thus resulting in an enormous volume of alarms at any time. Since network nodes are inter-connected, a single fault in the network would trigger multiple sequences of alarms across a variety of nodes and from a monitoring point of view, it is a challenging task for a NOC engineer to be aware of relations between the various alarms, when trying to identify, for example, a root alarm on which an action needs to be taken. To effectively identify root alarms, it is essential to learn relation among the alarms for accurate and faster resolution. In this work we propose a novel unsupervised alarm relation learning technique Temporal Optimization (TempOpt) that is practical and overcomes the limitations of an existing class of alarm relational learning method-temporal dependency methods. Experiments have been carried on real-world network datasets, that demonstrate the improved quality of alarm relations learned by TempOpt as compared to temporal dependency method.

[336] Wavelet Mixture of Experts for Time Series Forecasting

Zheng Zhou, Yu-Jie Xiong, Jia-Chen Zhang, Chun-Ming Xia, Xi-Jiong Xie

Main category: cs.LG

TL;DR: The paper introduces WaveTS-B and WaveTS-M, lightweight time series models combining wavelet transforms and MLPs to address limitations of Transformers and MLPs in capturing non-stationary and multi-channel dependencies. WaveTS-M, enhanced with a Mixture of Experts framework, achieves SOTA performance with fewer parameters.

Details

Motivation: Existing Transformer and MLP models for time series forecasting are limited by high parameter counts, inability to capture non-stationary features, and poor multi-channel dependency handling.

Method: Proposes WaveTS-B (wavelet transforms + MLP) for periodic and non-stationary data, and WaveTS-M (with MoE framework) for multi-channel dependencies.

Result: Empirical tests on eight datasets show WaveTS models achieve SOTA performance with fewer parameters, especially WaveTS-M for multi-channel data.

Conclusion: WaveTS models effectively address key limitations in time series forecasting, offering lightweight, high-performance alternatives to existing approaches.

Abstract: The field of time series forecasting is rapidly advancing, with recent large-scale Transformers and lightweight Multilayer Perceptron (MLP) models showing strong predictive performance. However, conventional Transformer models are often hindered by their large number of parameters and their limited ability to capture non-stationary features in data through smoothing. Similarly, MLP models struggle to manage multi-channel dependencies effectively. To address these limitations, we propose a novel, lightweight time series prediction model, WaveTS-B. This model combines wavelet transforms with MLP to capture both periodic and non-stationary characteristics of data in the wavelet domain. Building on this foundation, we propose a channel clustering strategy that incorporates a Mixture of Experts (MoE) framework, utilizing a gating mechanism and expert network to handle multi-channel dependencies efficiently. We propose WaveTS-M, an advanced model tailored for multi-channel time series prediction. Empirical evaluation across eight real-world time series datasets demonstrates that our WaveTS series models achieve state-of-the-art (SOTA) performance with significantly fewer parameters. Notably, WaveTS-M shows substantial improvements on multi-channel datasets, highlighting its effectiveness.

[337] Flow Battery Manifold Design with Heterogeneous Inputs Through Generative Adversarial Neural Networks

Eric Seng, Hugh O’Connor, Adam Boyce, Josh J. Bailey, Anton van Beek

Main category: cs.LG

TL;DR: A framework for creating tailored datasets for generative models and improving interpretability in design exploration, validated by flow battery manifold design.

Details

Motivation: Address constraints of large datasets and lack of interpretability in generative machine learning for design.

Method: Systematic framework for dataset construction and integration of generative models with Bayesian optimization.

Result: Effective capture of feasible designs, including novel configurations, and enhanced interpretability.

Conclusion: Broadens generative model applicability in system design by improving quality and reliability.

Abstract: Generative machine learning has emerged as a powerful tool for design representation and exploration. However, its application is often constrained by the need for large datasets of existing designs and the lack of interpretability about what features drive optimality. To address these challenges, we introduce a systematic framework for constructing training datasets tailored to generative models and demonstrate how these models can be leveraged for interpretable design. The novelty of this work is twofold: (i) we present a systematic framework for generating archetypes with internally homogeneous but mutually heterogeneous inputs that can be used to generate a training dataset, and (ii) we show how integrating generative models with Bayesian optimization can enhance the interpretability of the latent space of admissible designs. These findings are validated by using the framework to design a flow battery manifold, demonstrating that it effectively captures the space of feasible designs, including novel configurations while enabling efficient exploration. This work broadens the applicability of generative machine-learning models in system designs by enhancing quality and reliability.

[338] Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models

Fuyao Zhang, Xinyu Yan, Tiantong Wu, Wenjie Li, Tianxiang Chen, Yang Cao, Ran Yan, Longtao Huang, Wei Yang Bryan Lim, Qiang Yang

Main category: cs.LG

TL;DR: Oblivionis is a framework for federated LLM unlearning, addressing GDPR compliance by enabling selective data removal post-training without compromising model utility.

Details

Motivation: Federated LLM frameworks lack mechanisms for regulatory compliance (e.g., GDPR's right to be forgotten) and struggle with data quality and governance.

Method: Introduces Oblivionis, a lightweight framework unifying FL and unlearning as a dual optimization objective, incorporating 6 FL and 5 unlearning algorithms.

Result: Outperforms local training, balancing forgetting efficacy and model utility, with cross-algorithm comparisons guiding future LLM development.

Conclusion: Oblivionis enhances trustworthiness and compliance in federated LLM training, offering a robust solution for selective data removal.

Abstract: Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s right to be forgotten. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce Oblivionis, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate 6 FL and 5 unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development.

[339] Towards Scalable Lottery Ticket Networks using Genetic Algorithms

Julian Schönberger, Maximilian Zorn, Jonas Nüßlein, Thomas Gabor, Philipp Altmann

Main category: cs.LG

TL;DR: The paper explores using genetic algorithms to identify subnetworks in randomly initialized neural networks that perform well without training, achieving better accuracy and sparsity than current methods.

Details

Motivation: To rethink deep learning paradigms by avoiding overparameterization and expensive training, leveraging the Strong Lottery Ticket Hypothesis for efficient, scalable models.

Method: Genetic algorithms are used to find high-performing subnetworks in untrained, overparameterized networks, tested on binary and multi-class tasks.

Result: The approach outperforms state-of-the-art methods in accuracy and sparsity without needing gradient information.

Conclusion: Genetic algorithms effectively identify strong lottery ticket subnetworks, but appropriate evaluation metrics are crucial for scaling to complex tasks.

Abstract: Building modern deep learning systems that are not just effective but also efficient requires rethinking established paradigms for model training and neural architecture design. Instead of adapting highly overparameterized networks and subsequently applying model compression techniques to reduce resource consumption, a new class of high-performing networks skips the need for expensive parameter updates, while requiring only a fraction of parameters, making them highly scalable. The Strong Lottery Ticket Hypothesis posits that within randomly initialized, sufficiently overparameterized neural networks, there exist subnetworks that can match the accuracy of the trained original model-without any training. This work explores the usage of genetic algorithms for identifying these strong lottery ticket subnetworks. We find that for instances of binary and multi-class classification tasks, our approach achieves better accuracies and sparsity levels than the current state-of-the-art without requiring any gradient information. In addition, we provide justification for the need for appropriate evaluation metrics when scaling to more complex network architectures and learning tasks.

[340] Hi-fi functional priors by learning activations

Marcin Sendera, Amin Sorkhei, Tomasz Kuśmierczyk

Main category: cs.LG

TL;DR: The paper explores using trainable activations in Bayesian Neural Networks (BNNs) to impose function-space priors, enhancing regularization and uncertainty quantification.

Details

Motivation: Function-space priors in BNNs offer intuitive belief embedding but are challenging to implement. The study aims to address this challenge.

Method: The study employs optimization techniques with flexible activation models (e.g., Pade functions, piecewise linear functions) to accommodate complex priors.

Result: Empirical results show BNNs with a single wide hidden layer and trainable activations can effectively achieve desired function-space priors.

Conclusion: Trainable activations enable BNNs to match intricate target function distributions, overcoming implementation challenges of function-space priors.

Abstract: Function-space priors in Bayesian Neural Networks (BNNs) provide a more intuitive approach to embedding beliefs directly into the model’s output, thereby enhancing regularization, uncertainty quantification, and risk-aware decision-making. However, imposing function-space priors on BNNs is challenging. We address this task through optimization techniques that explore how trainable activations can accommodate higher-complexity priors and match intricate target function distributions. We investigate flexible activation models, including Pade functions and piecewise linear functions, and discuss the learning challenges related to identifiability, loss construction, and symmetries. Our empirical findings indicate that even BNNs with a single wide hidden layer when equipped with flexible trainable activation, can effectively achieve desired function-space priors.

[341] Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption

Audrey Poinsot, Panayiotis Panayiotou, Alessandro Leite, Nicolas Chesneau, Özgür Şimşek, Marc Schoenauer

Main category: cs.LG

TL;DR: Synthetic experiments are essential for evaluating causal machine learning methods, despite criticism, and proposed principles can improve their reliability and adoption.

Details

Motivation: Address the underutilization of causal machine learning due to unreliable evaluations and criticism of synthetic experiments.

Method: Critically review current evaluation practices and propose principles for rigorous empirical analyses with synthetic data.

Result: Synthetic experiments are necessary for precise assessment, and adopting proposed principles can enhance trust and adoption.

Conclusion: Rigorous synthetic evaluations can drive broader and impactful use of causal machine learning.

Abstract: Causal machine learning has the potential to revolutionize decision-making by combining the predictive power of machine learning algorithms with the theory of causal inference. However, these methods remain underutilized by the broader machine learning community, in part because current empirical evaluations do not permit assessment of their reliability and robustness, undermining their practical utility. Specifically, one of the principal criticisms made by the community is the extensive use of synthetic experiments. We argue, on the contrary, that synthetic experiments are essential and necessary to precisely assess and understand the capabilities of causal machine learning methods. To substantiate our position, we critically review the current evaluation practices, spotlight their shortcomings, and propose a set of principles for conducting rigorous empirical analyses with synthetic data. Adopting the proposed principles will enable comprehensive evaluations that build trust in causal machine learning methods, driving their broader adoption and impactful real-world use.

[342] Stationarity Exploration for Multivariate Time Series Forecasting

Hao Liu, Chun Yang, Zhang xiaoxing, Rui Ma, Xiaobin Zhu

Main category: cs.LG

TL;DR: APRNet is a deep learning model for time series forecasting that decouples amplitude and phase in frequency domain data to better capture stationary information, outperforming existing methods.

Details

Motivation: Existing methods struggle to explore stationary information from complex frequency components in time series data.

Method: APRNet models amplitude-phase inter-relationships, uses multivariate input representation, and introduces a KLC module for adaptive local function fitting.

Result: APRNet outperforms state-of-the-art methods in capturing time-varying patterns.

Conclusion: APRNet effectively decouples signal characteristics, enhancing stationary feature capture and forecasting accuracy.

Abstract: Deep learning-based time series forecasting has found widespread applications. Recently, converting time series data into the frequency domain for forecasting has become popular for accurately exploring periodic patterns. However, existing methods often cannot effectively explore stationary information from complex intertwined frequency components. In this paper, we propose a simple yet effective Amplitude-Phase Reconstruct Network (APRNet) that models the inter-relationships of amplitude and phase, which prevents the amplitude and phase from being constrained by different physical quantities, thereby decoupling the distinct characteristics of signals for capturing stationary information. Specifically, we represent the multivariate time series input across sequence and channel dimensions, highlighting the correlation between amplitude and phase at multiple interaction frequencies. We propose a novel Kolmogorov-Arnold-Network-based Local Correlation (KLC) module to adaptively fit local functions using univariate functions, enabling more flexible characterization of stationary features across different amplitudes and phases. This significantly enhances the model’s capability to capture time-varying patterns. Extensive experiments demonstrate the superiority of our APRNet against the state-of-the-arts (SOTAs).

[343] Exploring Cross-Stage Adversarial Transferability in Class-Incremental Continual Learning

Jungwoo Kim, Jong-Seok Lee

Main category: cs.LG

TL;DR: The paper explores the vulnerability of class-incremental continual learning models to stage-transferred adversarial attacks, revealing high susceptibility and inadequate defenses.

Details

Motivation: To investigate the overlooked security issue of adversarial attacks in continual learning, specifically stage-transferred attacks.

Method: Analyze vulnerability by transferring adversarial examples from earlier to later stages, examining model similarity and robustness degradation.

Result: Continual learning models are highly vulnerable to stage-transferred attacks, and existing defenses are insufficient.

Conclusion: The study highlights a critical security gap in continual learning, urging improved defenses against stage-transferred attacks.

Abstract: Class-incremental continual learning addresses catastrophic forgetting by enabling classification models to preserve knowledge of previously learned classes while acquiring new ones. However, the vulnerability of the models against adversarial attacks during this process has not been investigated sufficiently. In this paper, we present the first exploration of vulnerability to stage-transferred attacks, i.e., an adversarial example generated using the model in an earlier stage is used to attack the model in a later stage. Our findings reveal that continual learning methods are highly susceptible to these attacks, raising a serious security issue. We explain this phenomenon through model similarity between stages and gradual robustness degradation. Additionally, we find that existing adversarial training-based defense methods are not sufficiently effective to stage-transferred attacks. Codes are available at https://github.com/mcml-official/CSAT.

[344] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Ze Tao, Hanxuan Wang, Fujun Liu

Main category: cs.LG

TL;DR: LNN-PINN enhances PINNs with a liquid residual gating architecture, improving predictive accuracy without altering the original physics modeling or optimization pipeline.

Details

Motivation: PINNs often lack predictive accuracy in complex problems, prompting the need for architectural refinements.

Method: Introduces a lightweight gating mechanism within hidden-layer mapping while keeping other components unchanged.

Result: Reduces RMSE and MAE across benchmarks, showing adaptability to varying conditions.

Conclusion: LNN-PINN provides an effective architectural enhancement for PINNs in complex applications.

Abstract: Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

[345] Generalising Traffic Forecasting to Regions without Traffic Observations

Xinyu Su, Majid Sarvi, Feng Liu, Egemen Tanin, Jianzhong Qi

Main category: cs.LG

TL;DR: GenCast improves traffic forecasting for regions without sensors by leveraging external knowledge, physics-informed neural networks, and spatial grouping.

Details

Motivation: Traffic forecasting is vital, but sensor absence in some regions limits existing models' generalisability.

Method: GenCast integrates physics-informed neural networks, external signal learning, and spatial grouping to enhance forecasting.

Result: GenCast reduces forecasting errors across multiple real-world datasets.

Conclusion: GenCast effectively addresses the challenge of forecasting in sensor-less regions by combining external knowledge and physical principles.

Abstract: Traffic forecasting is essential for intelligent transportation systems. Accurate forecasting relies on continuous observations collected by traffic sensors. However, due to high deployment and maintenance costs, not all regions are equipped with such sensors. This paper aims to forecast for regions without traffic sensors, where the lack of historical traffic observations challenges the generalisability of existing models. We propose a model named GenCast, the core idea of which is to exploit external knowledge to compensate for the missing observations and to enhance generalisation. We integrate physics-informed neural networks into GenCast, enabling physical principles to regularise the learning process. We introduce an external signal learning module to explore correlations between traffic states and external signals such as weather conditions, further improving model generalisability. Additionally, we design a spatial grouping module to filter localised features that hinder model generalisability. Extensive experiments show that GenCast consistently reduces forecasting errors on multiple real-world datasets.

[346] GRAVITY: A Controversial Graph Representation Learning for Vertex Classification

Etienne Gael Tajeuna, Jean Marie Tshimula

Main category: cs.LG

TL;DR: GRAVITY is a graph-based framework for vertex classification, using dynamic interactions and a latent potential field to improve accuracy.

Details

Motivation: To enhance vertex classification by mimicking physical self-organization under attractive forces, addressing limitations of static neighborhood methods.

Method: Models vertices as exerting influence via learned interactions (structural proximity, attribute similarity), creating a latent potential field for dynamic aggregation.

Result: Competitive embeddings, excelling in transductive and inductive vertex classification tasks.

Conclusion: GRAVITY’s adaptive, field-driven approach sharpens class boundaries and improves semantic coherence.

Abstract: In the quest of accurate vertex classification, we introduce GRAVITY (Graph-based Representation leArning via Vertices Interaction TopologY), a framework inspired by physical systems where objects self-organize under attractive forces. GRAVITY models each vertex as exerting influence through learned interactions shaped by structural proximity and attribute similarity. These interactions induce a latent potential field in which vertices move toward energy efficient positions, coalescing around class-consistent attractors and distancing themselves from unrelated groups. Unlike traditional message-passing schemes with static neighborhoods, GRAVITY adaptively modulates the receptive field of each vertex based on a learned force function, enabling dynamic aggregation driven by context. This field-driven organization sharpens class boundaries and promotes semantic coherence within latent clusters. Experiments on real-world benchmarks show that GRAVITY yields competitive embeddings, excelling in both transductive and inductive vertex classification tasks.

[347] Fre-CW: Targeted Attack on Time Series Forecasting using Frequency Domain Loss

Naifu Feng, Lixing Chen, Junhua Tang, Hua Ding, Jianhua Li, Yang Bai

Main category: cs.LG

TL;DR: The paper introduces a frequency-domain-based adversarial attack method for time series forecasting, showing current models’ vulnerability to such attacks.

Details

Motivation: Despite progress in Transformer-based time series forecasting, adversarial robustness remains understudied, especially regarding frequency-domain features.

Method: Adapts a classification attack method to prediction tasks, optimizing adversarial samples with time and frequency-domain losses.

Result: Demonstrates current models’ vulnerability to adversarial attacks; the proposed method performs well on major datasets.

Conclusion: Highlights the need for adversarial robustness in time series forecasting and introduces a novel frequency-domain attack approach.

Abstract: Transformer-based models have made significant progress in time series forecasting. However, a key limitation of deep learning models is their susceptibility to adversarial attacks, which has not been studied enough in the context of time series prediction. In contrast to areas such as computer vision, where adversarial robustness has been extensively studied, frequency domain features of time series data play an important role in the prediction task but have not been sufficiently explored in terms of adversarial attacks. This paper proposes a time series prediction attack algorithm based on frequency domain loss. Specifically, we adapt an attack method originally designed for classification tasks to the prediction field and optimize the adversarial samples using both time-domain and frequency-domain losses. To the best of our knowledge, there is no relevant research on using frequency information for time-series adversarial attacks. Our experimental results show that these current time series prediction models are vulnerable to adversarial attacks, and our approach achieves excellent performance on major time series forecasting datasets.

[348] Integrating attention into explanation frameworks for language and vision transformers

Marte Eggen, Jacob Lysnæs-Larsen, Inga Strümke

Main category: cs.LG

TL;DR: The paper explores using attention weights in transformers to enhance explainability in AI, proposing two novel methods integrating attention into Shapley values and concept activation vectors for local and global explanations.

Details

Motivation: Attention weights in transformers offer interpretable signals, but their direct role in model outputs is unclear. The study aims to leverage these weights to improve explainability in AI frameworks.

Method: Develops two methods: (1) integrating attention weights into Shapley value decomposition for local explanations, and (2) incorporating them into token-level directional derivatives via concept activation vectors for global explanations.

Result: Empirical evaluations show attention weights can meaningfully enhance transformer explainability in both NLP and computer vision tasks.

Conclusion: Attention weights enrich transformer explainability, demonstrating their value when integrated into XAI frameworks.

Abstract: The attention mechanism lies at the core of the transformer architecture, providing an interpretable model-internal signal that has motivated a growing interest in attention-based model explanations. Although attention weights do not directly determine model outputs, they reflect patterns of token influence that can inform and complement established explainability techniques. This work studies the potential of utilising the information encoded in attention weights to provide meaningful model explanations by integrating them into explainable AI (XAI) frameworks that target fundamentally different aspects of model behaviour. To this end, we develop two novel explanation methods applicable to both natural language processing and computer vision tasks. The first integrates attention weights into the Shapley value decomposition by redefining the characteristic function in terms of pairwise token interactions via attention weights, thus adapting this widely used game-theoretic solution concept to provide attention-driven attributions for local explanations. The second incorporates attention weights into token-level directional derivatives defined through concept activation vectors to measure concept sensitivity for global explanations. Our empirical evaluations on standard benchmarks and in a comparison study with widely used explanation methods show that attention weights can be meaningfully incorporated into the studied XAI frameworks, highlighting their value in enriching transformer explainability.

[349] Low-Regret and Low-Complexity Learning for Hierarchical Inference

Sameep Chattopadhyay, Vinay Sutar, Jaya Prakash Champati, Sharayu Moharir

Main category: cs.LG

TL;DR: The paper introduces Hierarchical Inference Learning (HIL) for edge intelligence systems, proposing two policies (HI-LCB and HI-LCB-lite) to optimize local and remote model collaboration, achieving superior performance and efficiency.

Details

Motivation: To address the challenge of dynamically estimating incorrect local inference likelihood in hierarchical edge intelligence systems, especially under changing data distributions and offloading costs.

Method: Proposes HI-LCB and HI-LCB-lite policies using the Upper Confidence Bound (UCB) framework, modeling correct local inference probability as a function of confidence.

Result: Achieves order-optimal regret of O(log T), outperforming existing methods with O(T^{2/3}) regret, and HI-LCB-lite offers O(1) computational complexity.

Conclusion: The proposed policies significantly improve accuracy, latency, and bandwidth usage, making them practical for resource-limited edge devices.

Abstract: This work focuses on Hierarchical Inference (HI) in edge intelligence systems, where a compact Local-ML model on an end-device works in conjunction with a high-accuracy Remote-ML model on an edge-server. HI aims to reduce latency, improve accuracy, and lower bandwidth usage by first using the Local-ML model for inference and offloading to the Remote-ML only when the local inference is likely incorrect. A critical challenge in HI is estimating the likelihood of the local inference being incorrect, especially when data distributions and offloading costs change over time – a problem we term Hierarchical Inference Learning (HIL). We introduce a novel approach to HIL by modeling the probability of correct inference by the Local-ML as an increasing function of the model’s confidence measure, a structure motivated by empirical observations but previously unexploited. We propose two policies, HI-LCB and HI-LCB-lite, based on the Upper Confidence Bound (UCB) framework. We demonstrate that both policies achieve order-optimal regret of $O(\log T)$, a significant improvement over existing HIL policies with $O(T^{2/3})$ regret guarantees. Notably, HI-LCB-lite has an $O(1)$ per-sample computational complexity, making it well-suited for deployment on devices with severe resource limitations. Simulations using real-world datasets confirm that our policies outperform existing state-of-the-art HIL methods.

[350] MechaFormer: Sequence Learning for Kinematic Mechanism Design Automation

Diana Bolanos, Mohammadmehdi Ataei, Pradeep Kumar Jayaraman

Main category: cs.LG

TL;DR: MechaFormer, a Transformer-based model, translates target curves into mechanism designs, outperforming baselines with high accuracy and diversity.

Details

Motivation: The challenge of designing mechanical mechanisms for specific paths is complex due to vast search spaces of topologies and parameters.

Method: MechaFormer treats mechanism design as conditional sequence generation, translating curves into domain-specific language (DSL) strings to determine topology and parameters.

Result: Achieves state-of-the-art path-matching accuracy, generates diverse novel designs, and improves solution quality with sampling strategies.

Conclusion: MechaFormer’s outputs enhance traditional optimizers, forming a hybrid approach for superior, efficient solutions.

Abstract: Designing mechanical mechanisms to trace specific paths is a classic yet notoriously difficult engineering problem, characterized by a vast and complex search space of discrete topologies and continuous parameters. We introduce MechaFormer, a Transformer-based model that tackles this challenge by treating mechanism design as a conditional sequence generation task. Our model learns to translate a target curve into a domain-specific language (DSL) string, simultaneously determining the mechanism’s topology and geometric parameters in a single, unified process. MechaFormer significantly outperforms existing baselines, achieving state-of-the-art path-matching accuracy and generating a wide diversity of novel and valid designs. We demonstrate a suite of sampling strategies that can dramatically improve solution quality and offer designers valuable flexibility. Furthermore, we show that the high-quality outputs from MechaFormer serve as excellent starting points for traditional optimizers, creating a hybrid approach that finds superior solutions with remarkable efficiency.

[351] FetFIDS: A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm

Shreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal

Main category: cs.LG

TL;DR: FetFIDS improves intrusion detection by using feature embedding in a transformer-based deep learning system, optimized for federated learning environments.

Details

Motivation: To enhance intrusion detection performance and privacy in edge learning scenarios.

Method: Feature embedding replaces positional embedding in a transformer-based deep learning system, tested in federated learning.

Result: FetFIDS outperforms state-of-the-art IDS in federated environments and shows high suitability for federated learning.

Conclusion: FetFIDS is effective for improving intrusion detection in federated learning settings, balancing performance and privacy.

Abstract: Intrusion Detection Systems (IDS) have an increasingly important role in preventing exploitation of network vulnerabilities by malicious actors. Recent deep learning based developments have resulted in significant improvements in the performance of IDS systems. In this paper, we present FetFIDS, where we explore the employment of feature embedding instead of positional embedding to improve intrusion detection performance of a transformer based deep learning system. Our model is developed with the aim of deployments in edge learning scenarios, where federated learning over multiple communication rounds can ensure both privacy and localized performance improvements. FetFIDS outperforms multiple state-of-the-art intrusion detection systems in a federated environment and demonstrates a high degree of suitability to federated learning. The code for this work can be found at https://github.com/ghosh64/fetfids.

[352] Causal Machine Learning for Patient-Level Intraoperative Opioid Dose Prediction from Electronic Health Records

Jonas Valbjørn Andersena, Anders Peder Højer Karlsen, Markus Harboe Olsen, Nikolaj Krebs Pedersen

Main category: cs.LG

TL;DR: OPIAID algorithm predicts personalized opioid dosages using machine learning on EHR data to optimize pain management and reduce adverse events.

Details

Motivation: To improve pain management and minimize opioid-related adverse events by providing personalized dosage recommendations.

Method: Uses causal machine learning on observational EHR data, considering patient-specific traits and opiate influences.

Result: Enables personalized opioid dose recommendations tailored to individual patient needs.

Conclusion: OPIAID offers a promising approach for safer and more effective opioid pain management.

Abstract: This paper introduces the OPIAID algorithm, a novel approach for predicting and recommending personalized opioid dosages for individual patients. The algorithm optimizes pain management while minimizing opioid related adverse events (ORADE) by employing machine learning models trained on observational electronic health records (EHR) data. It leverages a causal machine learning approach to understand the relationship between opioid dose, case specific patient and intraoperative characteristics, and pain versus ORADE outcomes. The OPIAID algorithm considers patient-specific characteristics and the influence of different opiates, enabling personalized dose recommendations. This paper outlines the algorithm’s methodology and architecture, and discusses key assumptions, and approaches to evaluating its performance.

[353] Meta-learning optimizes predictions of missing links in real-world networks

Bisman Singh, Lucy Van Kleunen, Aaron Clauset

Main category: cs.LG

TL;DR: The paper evaluates link prediction methods for incomplete relational data, comparing model stacking, neural networks, and topological predictors. It finds no single best method, introduces a meta-learning algorithm for optimal performance, and highlights the impact of network characteristics on algorithm choice.

Details

Motivation: To determine the best link prediction methods for incomplete networks and understand how network characteristics influence algorithm performance.

Method: Systematic comparison of four stacking algorithms, 42 topological link predictors, and two graph neural networks across 550 real-world networks, using AUC and Top-k accuracy measures. Introduces a meta-learning algorithm for optimal method selection.

Result: No single algorithm performs best across all networks. Model stacking with random forest is highly scalable and competitive with graph neural networks. Performance depends on network characteristics like degree distribution and triangle density. The meta-learning algorithm outperforms state-of-the-art methods.

Conclusion: Algorithm choice for link prediction should consider network characteristics. The introduced meta-learning algorithm optimizes performance by selecting the best method per network, offering scalability and superior accuracy.

Abstract: Relational data are ubiquitous in real-world data applications, e.g., in social network analysis or biological modeling, but networks are nearly always incompletely observed. The state-of-the-art for predicting missing links in the hard case of a network without node attributes uses model stacking or neural network techniques. It remains unknown which approach is best, and whether or how the best choice of algorithm depends on the input network’s characteristics. We answer these questions systematically using a large, structurally diverse benchmark of 550 real-world networks under two standard accuracy measures (AUC and Top-k), comparing four stacking algorithms with 42 topological link predictors, two of which we introduce here, and two graph neural network algorithms. We show that no algorithm is best across all input networks, all algorithms perform well on most social networks, and few perform well on economic and biological networks. Overall, model stacking with a random forest is both highly scalable and surpasses on AUC or is competitive with graph neural networks on Top-k accuracy. But, algorithm performance depends strongly on network characteristics like the degree distribution, triangle density, and degree assortativity. We introduce a meta-learning algorithm that exploits this variability to optimize link predictions for individual networks by selecting the best algorithm to apply, which we show outperforms all state-of-the-art algorithms and scales to large networks.

[354] Scaling Up Active Testing to Large Language Models

Gabrielle Berrada, Jannik Kossen, Muhammed Razzak, Freddie Bickford Smith, Yarin Gal, Tom Rainforth

Main category: cs.LG

TL;DR: Active testing scales to evaluate large language models (LLMs) efficiently by using in-context learning for surrogate models, reducing computational costs and data needs.

Details

Motivation: Address the high computational costs of active testing for large models, making it feasible for LLMs.

Method: Use in-context learning to cheaply construct surrogate models, avoid updates during active testing, and employ smaller surrogate models. Also, introduce a single-run error estimator.

Result: The approach evaluates LLM performance more effectively with less data than standard practices.

Conclusion: Active testing can be efficiently scaled for LLMs, reducing computational and data requirements.

Abstract: Active testing enables label-efficient evaluation of models through careful data acquisition. However, its significant computational costs have previously undermined its use for large models. We show how it can be successfully scaled up to the evaluation of large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without computing predictions with the target model and further introduce a single-run error estimator to asses how well active testing is working on the fly. We find that our approach is able to more effectively evaluate LLM performance with less data than current standard practices.

[355] Chi-Geometry: A Library for Benchmarking Chirality Prediction of GNNs

Rylie Weaver, Massamiliano Lupo Pasini

Main category: cs.LG

TL;DR: Chi-Geometry is a library for generating synthetic graph data to test GNNs’ chirality prediction, enabling interpretable benchmarking and guiding new GNN designs.

Details

Motivation: To provide a controlled, synthetic dataset for benchmarking GNNs' ability to predict chirality, minimizing confounding factors.

Method: Generates synthetic graphs with user-specified traits, randomized node positions/species, and labeled chiral centers (R/S). Combines samples into datasets for node classification tasks.

Result: Demonstrated efficacy by benchmarking SOTA GNNs, leading to two new architectures: one with all-to-all connections (high accuracy, quadratic cost) and one with a virtual node (linear cost, competitive accuracy).

Conclusion: Chi-Geometry enables better benchmarking and inspires improved GNN designs for chirality prediction.

Abstract: We introduce Chi-Geometry - a library that generates graph data for testing and benchmarking GNNs’ ability to predict chirality. Chi-Geometry generates synthetic graph samples with (i) user-specified geometric and topological traits to isolate certain types of samples and (ii) randomized node positions and species to minimize extraneous correlations. Each generated graph contains exactly one chiral center labeled either R or S, while all other nodes are labeled N/A (non-chiral). The generated samples are then combined into a cohesive dataset that can be used to assess a GNN’s ability to predict chirality as a node classification task. Chi-Geometry allows more interpretable and less confounding benchmarking of GNNs for prediction of chirality in the graph samples which can guide the design of new GNN architectures with improved predictive performance. We illustrate Chi-Geometry’s efficacy by using it to generate synthetic datasets for benchmarking various state-of-the-art (SOTA) GNN architectures. The conclusions of these benchmarking results guided our design of two new GNN architectures. The first GNN architecture established all-to-all connections in the graph to accurately predict chirality across all challenging configurations where previously tested SOTA models failed, but at a computational cost (both for training and inference) that grows quadratically with the number of graph nodes. The second GNN architecture avoids all-to-all connections by introducing a virtual node in the original graph structure of the data, which restores the linear scaling of training and inference computational cost with respect to the number of nodes in the graph, while still ensuring competitive accuracy in detecting chirality with respect to SOTA GNN architectures.

[356] Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving

Tianyun Yang, Yunwen Li, Ziniu Li, Zhihang Lin, Ruoyu Sun, Tian Ding

Main category: cs.LG

TL;DR: The paper introduces GF-Reasoner, a model combining Chain-of-Thought reasoning with formal language to improve Geometry Problem Solving, achieving 15% accuracy gains over peers.

Details

Motivation: Address limitations of large vision-language models in Geometry Problem Solving (GPS) due to unreliable diagram interpretation and lack of intermediate reasoning.

Method: Integrates Chain-of-Thought (CoT) with formal language, interleaving natural language reasoning with solver-executable code. Uses supervised fine-tuning on synthetic data and solver-in-the-loop reinforcement learning.

Result: GF-Reasoner achieves up to 15% accuracy improvements on GPS benchmarks, surpassing 7B-scale peers and larger models like Qwen2.5-VL-72B.

Conclusion: The hybrid reasoning approach enhances accuracy and clarity in GPS, with insights on design choices for future research.

Abstract: Large vision language models exhibit notable limitations on Geometry Problem Solving (GPS) because of their unreliable diagram interpretation and pure natural-language reasoning. A recent line of work mitigates this by using symbolic solvers: the model directly generates a formal program that a geometry solver can execute. However, this direct program generation lacks intermediate reasoning, making the decision process opaque and prone to errors. In this work, we explore a new approach that integrates Chain-of-Thought (CoT) with formal language. The model interleaves natural language reasoning with incremental emission of solver-executable code, producing a hybrid reasoning trace in which critical derivations are expressed in formal language. To teach this behavior at scale, we combine (1) supervised fine-tuning on an 11K newly developed synthetic dataset with interleaved natural language reasoning and automatic formalization, and (2) solver-in-the-loop reinforcement learning that jointly optimizes both the CoT narrative and the resulting program through outcome-based rewards. Built on Qwen2.5-VL-7B, our new model, named GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks, surpassing both 7B-scale peers and the much larger model Qwen2.5-VL-72B. By exploiting high-order geometric knowledge and offloading symbolic computation to the solver, the generated reasoning traces are noticeably shorter and cleaner. Furthermore, we present a comprehensive analysis of method design choices (e.g., reasoning paradigms, data synthesis, training epochs, etc.), providing actionable insights for future research.

[357] Towards Universal Neural Inference

Shreyas Bhat Brahmavar, Yang Li, Junier Oliva

Main category: cs.LG

TL;DR: ASPIRE is a universal neural model for reasoning over heterogeneous structured data, using a permutation-invariant Transformer and semantic grounding to align and predict across disjoint datasets.

Details

Motivation: Real-world data is diverse and disjoint, making it hard to build general-purpose models that work across datasets.

Method: ASPIRE combines a set-based Transformer with semantic grounding using natural language, metadata, and examples to learn cross-dataset dependencies.

Result: ASPIRE generalizes to new tasks without tuning, performs well on benchmarks, and supports cost-aware feature acquisition.

Conclusion: ASPIRE advances universal, semantics-aware inference for structured data.

Abstract: Real-world data often appears in diverse, disjoint forms – with varying schemas, inconsistent semantics, and no fixed feature ordering – making it challenging to build general-purpose models that can leverage information across datasets. We introduce ASPIRE, Arbitrary Set-based Permutation-Invariant Reasoning Engine, a Universal Neural Inference model for semantic reasoning and prediction over heterogeneous structured data. ASPIRE combines a permutation-invariant, set-based Transformer with a semantic grounding module that incorporates natural language descriptions, dataset metadata, and in-context examples to learn cross-dataset feature dependencies. This architecture allows ASPIRE to ingest arbitrary sets of feature–value pairs and support examples, align semantics across disjoint tables, and make predictions for any specified target. Once trained, ASPIRE generalizes to new inference tasks without additional tuning. In addition to delivering strong results across diverse benchmarks, ASPIRE naturally supports cost-aware active feature acquisition in an open-world setting, selecting informative features under test-time budget constraints for an arbitrary unseen dataset. These capabilities position ASPIRE as a step toward truly universal, semantics-aware inference over structured data.

[358] Deep Neural Network Calibration by Reducing Classifier Shift with Stochastic Masking

Jiani Ni, He Zhao, Yibo Yang, Dandan Guo

Main category: cs.LG

TL;DR: MaC-Cal is a mask-based classifier calibration method using stochastic sparsity to improve DNN confidence alignment with accuracy, addressing underconfidence issues.

Details

Motivation: DNNs often suffer from poor calibration, especially in safety-critical fields like autonomous driving and healthcare, where unreliable confidence estimates can have serious consequences. Existing methods overlook underconfidence errors.

Method: MaC-Cal employs a two-stage training scheme with adaptive sparsity, dynamically adjusting mask retention rates based on confidence-accuracy deviation.

Result: MaC-Cal achieves superior calibration performance and robustness under data corruption.

Conclusion: MaC-Cal provides a practical and effective solution for reliable confidence estimation in DNNs.

Abstract: In recent years, deep neural networks (DNNs) have shown competitive results in many fields. Despite this success, they often suffer from poor calibration, especially in safety-critical scenarios such as autonomous driving and healthcare, where unreliable confidence estimates can lead to serious consequences. Recent studies have focused on improving calibration by modifying the classifier, yet such efforts remain limited. Moreover, most existing approaches overlook calibration errors caused by underconfidence, which can be equally detrimental. To address these challenges, we propose MaC-Cal, a novel mask-based classifier calibration method that leverages stochastic sparsity to enhance the alignment between confidence and accuracy. MaC-Cal adopts a two-stage training scheme with adaptive sparsity, dynamically adjusting mask retention rates based on the deviation between confidence and accuracy. Extensive experiments show that MaC-Cal achieves superior calibration performance and robustness under data corruption, offering a practical and effective solution for reliable confidence estimation in DNNs.

[359] Tame Riemannian Stochastic Approximation

Johannes Aspman, Vyacheslav Kungurtsev, Reza Roohi Seraji

Main category: cs.LG

TL;DR: The paper explores stochastic approximation for tame nondifferentiable functions on Riemannian manifolds, showing theoretical guarantees for convergence and validating the use of SGD in such settings.

Details

Motivation: To extend the understanding of tame functions and stochastic sub-gradient descent to Riemannian manifolds, given their relevance in deep learning and optimization.

Method: The study employs stochastic sub-gradient descent (SGD) adapted for Riemannian manifolds, analyzing its convergence properties and numerical performance.

Result: Theoretical guarantees for expected function decrease and asymptotic convergence are established, and numerical experiments support the effectiveness of the approach.

Conclusion: The paper confirms the suitability of SGD for tame optimization on Riemannian manifolds, providing both theoretical and empirical validation.

Abstract: We study the properties of stochastic approximation applied to a tame nondifferentiable function subject to constraints defined by a Riemannian manifold. The objective landscape of tame functions, arising in o-minimal topology extended to a geometric category when generalized to manifolds, exhibits some structure that enables theoretical guarantees of expected function decrease and asymptotic convergence for generic stochastic sub-gradient descent. Recent work has shown that this class of functions faithfully model the loss landscape of deep neural network training objectives, and the autograd operation used in deep learning packages implements a variant of subgradient descent with the correct properties for convergence. Riemannian optimization uses geometric properties of a constraint set to perform a minimization procedure while enforcing adherence to the the optimization variable lying on a Riemannian manifold. This paper presents the first study of tame optimization on Riemannian manifolds, highlighting the rich geometric structure of the problem and confirming the appropriateness of the canonical “SGD” for such a problem with the analysis and numerical reports of a simple Retracted SGD algorithm.

[360] BELLA: Black box model Explanations by Local Linear Approximations

Nedeljko Radulovic, Albert Bifet, Fabian Suchanek

Main category: cs.LG

TL;DR: BELLA is a deterministic, model-agnostic method for explaining individual predictions of regression black-box models, offering accurate and general explanations without relying on synthetic data.

Details

Motivation: Current post-hoc explanation methods for regression models use synthetic data, introducing uncertainty and limited applicability, necessitating a more reliable and general approach.

Method: BELLA provides explanations via a linear model trained in the feature space, maximizing the neighborhood size for accuracy, simplicity, and robustness.

Result: BELLA produces explanations that are accurate, simple, general, and robust, addressing the limitations of existing methods.

Conclusion: BELLA offers a deterministic and reliable solution for explaining regression black-box models, improving upon synthetic-data-dependent approaches.

Abstract: Understanding the decision-making process of black-box models has become not just a legal requirement, but also an additional way to assess their performance. However, the state of the art post-hoc explanation approaches for regression models rely on synthetic data generation, which introduces uncertainty and can hurt the reliability of the explanations. Furthermore, they tend to produce explanations that apply to only very few data points. In this paper, we present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models. BELLA provides explanations in the form of a linear model trained in the feature space. BELLA maximizes the size of the neighborhood to which the linear model applies so that the explanations are accurate, simple, general, and robust.

[361] Neural Operator Variational Inference based on Regularized Stein Discrepancy for Deep Gaussian Processes

Jian Xu, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng

Main category: cs.LG

TL;DR: NOVI introduces a neural operator for variational inference in Deep Gaussian Processes, improving expressiveness and computational efficiency while ensuring robust error control.

Details

Motivation: Existing approximations for Deep Gaussian Processes (DGPs) limit expressiveness or are computationally expensive, prompting the need for a more effective method.

Method: NOVI uses a neural generator to sample and minimizes Regularized Stein Discrepancy in L2 space, employing Monte Carlo estimation and subsampling stochastic optimization.

Result: Achieves 93.56% accuracy on CIFAR10, outperforming state-of-the-art Gaussian process methods, with faster convergence and controlled prediction error.

Conclusion: NOVI enhances DGP performance, offering theoretical guarantees and practical benefits for Bayesian nonparametric models.

Abstract: Deep Gaussian Process (DGP) models offer a powerful nonparametric approach for Bayesian inference, but exact inference is typically intractable, motivating the use of various approximations. However, existing approaches, such as mean-field Gaussian assumptions, limit the expressiveness and efficacy of DGP models, while stochastic approximation can be computationally expensive. To tackle these challenges, we introduce Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes. NOVI uses a neural generator to obtain a sampler and minimizes the Regularized Stein Discrepancy in L2 space between the generated distribution and true posterior. We solve the minimax problem using Monte Carlo estimation and subsampling stochastic optimization techniques. We demonstrate that the bias introduced by our method can be controlled by multiplying the Fisher divergence with a constant, which leads to robust error control and ensures the stability and precision of the algorithm. Our experiments on datasets ranging from hundreds to tens of thousands demonstrate the effectiveness and the faster convergence rate of the proposed method. We achieve a classification accuracy of 93.56 on the CIFAR10 dataset, outperforming SOTA Gaussian process methods. Furthermore, our method guarantees theoretically controlled prediction error for DGP models and demonstrates remarkable performance on various datasets. We are optimistic that NOVI has the potential to enhance the performance of deep Bayesian nonparametric models and could have significant implications for various practical applications

[362] Multidimensional Adaptive Coefficient for Inference Trajectory Optimization in Flow and Diffusion

Dohoon Lee, Jaehyun Park, Hyunwoo J. Kim, Kyogu Lee

Main category: cs.LG

TL;DR: The paper introduces MAC, a plug-in module for flow and diffusion models, enhancing them with multidimensional adaptability and simulation-based feedback.

Details

Motivation: Current flow and diffusion models lack dimensionality freedom and adaptability to inference trajectories, limiting their simulation capabilities.

Method: Proposes MAC, which extends unidimensional coefficients to multidimensional ones and uses adversarial refinement for training.

Result: MAC improves generative quality efficiently across diverse frameworks and datasets.

Conclusion: MAC offers a new perspective on inference trajectory optimality, advocating for simulation-based optimization in future research.

Abstract: Flow and diffusion models have demonstrated strong performance and training stability across various tasks but lack two critical properties of simulation-based methods: freedom of dimensionality and adaptability to different inference trajectories. To address this limitation, we propose the Multidimensional Adaptive Coefficient (MAC), a plug-in module for flow and diffusion models that extends conventional unidimensional coefficients to multidimensional ones and enables inference trajectory-wise adaptation. MAC is trained via simulation-based feedback through adversarial refinement. Empirical results across diverse frameworks and datasets demonstrate that MAC enhances generative quality with high training efficiency. Consequently, our work offers a new perspective on inference trajectory optimality, encouraging future research to move beyond vector field design and to leverage training-efficient, simulation-based optimization.

[363] A DNN Biophysics Model with Topological and Electrostatic Features

Elyssa Sliheet, Md Abu Talha, Weihua Geng

Main category: cs.LG

TL;DR: A DNN-based biophysics model predicts protein properties using multi-scale topological and electrostatic features, showing optimal performance with combined features.

Details

Motivation: To develop a general tool for predicting biophysical properties of proteins by leveraging structural and force field data.

Method: Uses ESPH for topological features and Cartesian treecode for electrostatic features, balancing resolution and computational cost.

Result: Tests on 4000+ protein structures confirm the model’s efficiency and fidelity, especially with combined features.

Conclusion: The model is promising for broad biomolecule property prediction using computational and experimental data.

Abstract: In this project, we provide a deep-learning neural network (DNN) based biophysics model to predict protein properties. The model uses multi-scale and uniform topological and electrostatic features generated with protein structural information and force field, which governs the molecular mechanics. The topological features are generated using the element specified persistent homology (ESPH) while the electrostatic features are fast computed using a Cartesian treecode. These features are uniform in number for proteins with various sizes thus the broadly available protein structure database can be used in training the network. These features are also multi-scale thus the resolution and computational cost can be balanced by the users. The machine learning simulation on over 4000 protein structures shows the efficiency and fidelity of these features in representing the protein structure and force field for the predication of their biophysical properties such as electrostatic solvation energy. Tests on topological or electrostatic features alone and the combination of both showed the optimal performance when both features are used. This model shows its potential as a general tool in assisting biophysical properties and function prediction for the broad biomolecules using data from both theoretical computing and experiments.

[364] Task Diversity Shortens the ICL Plateau

Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu

Main category: cs.LG

TL;DR: Training on multiple diverse in-context learning (ICL) tasks simultaneously shortens loss plateaus, making learning easier, contrary to intuition.

Details

Motivation: To understand the capability of language models in ICL and the observed long loss plateaus followed by rapid learning.

Method: Studied training models on multiple diverse ICL tasks simultaneously.

Result: Diverse ICL tasks shorten loss plateaus, making learning easier, suggesting easier optimization due to data diversity.

Conclusion: Large-scale language model success may stem from both data richness and the easier training induced by diverse natural language data.

Abstract: In-context learning (ICL) describes a language model’s ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data.

[365] Zero-Shot Generalization of Vision-Based RL Without Data Augmentation

Sumeet Batra, Gaurav S. Sukhatme

Main category: cs.LG

TL;DR: ALDA model combines latent disentanglement with associative memory for zero-shot generalization in RL, avoiding costly data augmentation.

Details

Motivation: Addressing the challenge of generalizing vision-based RL agents to novel environments without relying on large datasets or data augmentation.

Method: Proposes ALDA, integrating latent disentanglement with associative memory in off-policy RL for zero-shot generalization.

Result: Achieves zero-shot generalization on difficult task variations without data augmentation.

Conclusion: Data augmentation is a weak form of disentanglement; ALDA offers a more efficient alternative for generalization.

Abstract: Generalizing vision-based reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variations without relying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.

Haodong Feng, Peiyan Hu, Yue Wang, Dixia Fan

Main category: cs.LG

TL;DR: The paper proposes a Physics-Informed Representation (PIR) algorithm to address sparse and inconsistent observations in fluid control tasks by integrating PDEs with neural networks.

Details

Motivation: Challenges in fluid control due to sparse or missing observations from sensor limitations and faults necessitate a robust method to unify multi-modal data.

Method: PIR combines sparse observational data with PDE information to learn a unified representation of fluid systems, focusing on initial and boundary conditions.

Result: PIR outperforms baselines in consistency with ground truth features and enhances control tasks when combined with Reinforcement Learning.

Conclusion: PIR effectively leverages PDEs and sparse data for accurate fluid system representation and control, demonstrating practical success in complex environments.

Abstract: Control in fluid environments is an important research area with numerous applications across various domains, including underwater robotics, aerospace engineering, and biomedical systems. However, in practice, control methods often face challenges due to sparse or missing observations, stemming from sensor limitations and faults. These issues result in observations that are not only sparse but also inconsistent in their number and modalities (e.g., velocity and pressure sensors). In this work, we propose a Physics-Informed Representation (PIR) algorithm for multi-modal policies of control to leverage the sparse and random observations in complex fluid environments. PIR integrates sparse observational data with the Partial Differential Equation (PDE) information to distill a unified representation of fluid systems. The main idea is that PDE solutions are determined by three elements: the equation, initial conditions, and boundary conditions. Given the equation, we only need to learn the representation of the initial and boundary conditions, which define a trajectory of a specific fluid system. Specifically, it leverages PDE loss to fit the neural network and data loss calculated on the observations with random quantities and multi-modalities to propagate the information with initial and boundary conditions into the representations. The representations are the learnable parameters or the output of the encoder. In the experiments, the PIR illustrates the superior consistency with the features of the ground truth compared with baselines, even when there are missing modalities. Furthermore, PIR combined with Reinforcement Learning has been successfully applied in control tasks where the robot leverages the learned state by PIR faster and more accurately, passing through the complex vortex street from a random starting location to reach a random target.

[367] Understanding Aggregations of Proper Learners in Multiclass Classification

Julian Asilis, Mikael Møller Høgsgaard, Grigoris Velegkas

Main category: cs.LG

TL;DR: The paper explores whether simple aggregations of proper learners can overcome the properness barrier in multiclass classification, showing positive results for finite Graph dimension classes but limitations for broader cases.

Details

Motivation: To investigate if aggregations of proper learners can address the properness barrier in multiclass classification, inspired by successes in binary classification.

Method: Generalizing optimal binary learners to multiclass settings and analyzing their sample complexity, complemented by lower bounds for certain classes.

Result: For finite Graph dimension classes, optimal learners achieve improved sample complexity, but broader classes cannot be learned by finite aggregations of proper learners.

Conclusion: Simple aggregations of proper learners can overcome the properness barrier for finite Graph dimension classes but fail for more general cases.

Abstract: Multiclass learnability is known to exhibit a properness barrier: there are learnable classes which cannot be learned by any proper learner. Binary classification faces no such barrier for learnability, but a similar one for optimal learning, which can in general only be achieved by improper learners. Fortunately, recent advances in binary classification have demonstrated that this requirement can be satisfied using aggregations of proper learners, some of which are strikingly simple. This raises a natural question: to what extent can simple aggregations of proper learners overcome the properness barrier in multiclass classification? We give a positive answer to this question for classes which have finite Graph dimension, $d_G$. Namely, we demonstrate that the optimal binary learners of Hanneke, Larsen, and Aden-Ali et al. (appropriately generalized to the multiclass setting) achieve sample complexity $O\left(\frac{d_G + \ln(1 / \delta)}{\epsilon}\right)$. This forms a strict improvement upon the sample complexity of ERM. We complement this with a lower bound demonstrating that for certain classes of Graph dimension $d_G$, majorities of ERM learners require $\Omega \left( \frac{d_G + \ln(1 / \delta)}{\epsilon}\right)$ samples. Furthermore, we show that a single ERM requires $\Omega \left(\frac{d_G \ln(1 / \epsilon) + \ln(1 / \delta)}{\epsilon}\right)$ samples on such classes, exceeding the lower bound of Daniely et al. (2015) by a factor of $\ln(1 / \epsilon)$. For multiclass learning in full generality – i.e., for classes of finite DS dimension but possibly infinite Graph dimension – we give a strong refutation to these learning strategies, by exhibiting a learnable class which cannot be learned to constant error by any aggregation of a finite number of proper learners.

[368] Decoding-based Regression

Xingyou Song, Dara Bahri

Main category: cs.LG

TL;DR: Language models can perform numeric regression via decoded strings, matching standard regression heads in performance while offering flexibility for tasks like density estimation.

Details

Motivation: To explore the theoretical grounds and utility of causal sequence decoding models for numeric regression, given their training for next-token prediction.

Method: Investigate decoder-based heads as numeric regression tools, comparing them to standard pointwise heads on regression tasks.

Result: Decoder-based heads perform as well as standard regression heads and can capture smooth numeric distributions, such as in density estimation.

Conclusion: Decoder-based models are effective for numeric regression, offering flexibility and performance comparable to traditional methods.

Abstract: Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal sequence decoding models as numeric regression heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoder-based heads are as performant as standard pointwise heads when benchmarked over standard regression tasks, while being flexible enough to capture smooth numeric distributions, such as in the task of density estimation.

[369] Chemist-aligned retrosynthesis by ensembling diverse inductive bias models

Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Austin Tripp, Junren Li, Aleksei Kornev, Piotr Gaiński, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler

Main category: cs.LG

TL;DR: RetroChimera, a new retrosynthesis model, outperforms existing models by combining two novel components with a learning-based ensembling strategy, showing robustness and alignment with chemists’ preferences.

Details

Motivation: AI-based synthesis planning struggles with rare reactions and incorrect predictions, limiting multi-step search algorithms and misaligning with chemists' expectations.

Method: RetroChimera integrates two new components with complementary inductive biases using a learning-based ensembling framework.

Result: RetroChimera outperforms major models, shows robustness, learns from few examples, and aligns with chemists’ preferences. It also generalizes well in zero-shot transfer.

Conclusion: RetroChimera’s ensembling framework advances retrosynthesis modeling, promising further improvements in accuracy and utility.

Abstract: Chemical synthesis remains a critical bottleneck in the discovery and manufacture of functional small molecules. AI-based synthesis planning models could be a potential remedy to find effective syntheses, and have made progress in recent years. However, they still struggle with less frequent, yet critical reactions for synthetic strategy, as well as hallucinated, incorrect predictions. This hampers multi-step search algorithms that rely on models, and leads to misalignment with chemists’ expectations. Here we propose RetroChimera: a frontier retrosynthesis model, built upon two newly developed components with complementary inductive biases, which we fuse together using a new framework for integrating predictions from multiple sources via a learning-based ensembling strategy. Through experiments across several orders of magnitude in data scale and splitting strategy, we show RetroChimera outperforms all major models by a large margin, demonstrating robustness outside the training data, as well as for the first time the ability to learn from even a very small number of examples per reaction class. Moreover, industrial organic chemists prefer predictions from RetroChimera over the reactions it was trained on in terms of quality, revealing high levels of alignment. Finally, we demonstrate zero-shot transfer to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our ensembling framework unlocks, we anticipate further acceleration in the development of even more accurate models.

[370] Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions

Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić

Main category: cs.LG

TL;DR: An end-to-end framework generates synthetic users for evaluating health coaching agents, grounded in real-world health conditions like sleep and diabetes, ensuring realistic interactions.

Details

Motivation: To create realistic synthetic users for evaluating interactive agents that encourage positive behavior changes in health and lifestyle coaching.

Method: Synthetic users are generated in two stages: structured data creation based on health factors, followed by full profile development. Interactions are simulated using generative models or language models.

Result: The framework’s validity is shown through agent understanding of user needs and expert evaluations, proving synthetic users with health attributes are more realistic than generic ones.

Conclusion: The framework enables efficient development of conversational agents through grounded, realistic simulated interactions.

Abstract: We present an end-to-end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real-world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent-based models such as Concordia, or directly by prompting a language model. Using two independently-developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent’s understanding of the synthetic users’ needs and challenges. Finally, through multiple blinded evaluations of user-coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.

[371] FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning

Davide Domini, Gianluca Aguzzi, Lukas Esterle, Mirko Viroli

Main category: cs.LG

TL;DR: FBFL is a novel federated learning approach addressing non-IID data challenges and centralized bottlenecks through spatial-based leader election and self-organizing hierarchical architecture, outperforming FedAvg and other methods in non-IID scenarios.

Details

Motivation: Federated learning (FL) faces scalability and performance issues in real-world deployments with non-IID data and centralized architectures.

Method: FBFL uses macroprogramming and field coordination for distributed spatial-based leader election and self-organizing hierarchical architecture.

Result: FBFL matches FedAvg in IID conditions and outperforms FedAvg, FedProx, and Scaffold in non-IID scenarios, while showing resilience to server failures.

Conclusion: FBFL effectively addresses FL limitations, offering improved performance and resilience in non-IID and dynamic environments.

Abstract: In the last years, Federated learning (FL) has become a popular solution to train machine learning models in domains with high privacy concerns. However, FL scalability and performance face significant challenges in real-world deployments where data across devices are non-independently and identically distributed (non-IID). The heterogeneity in data distribution frequently arises from spatial distribution of devices, leading to degraded model performance in the absence of proper handling. Additionally, FL typical reliance on centralized architectures introduces bottlenecks and single-point-of-failure risks, particularly problematic at scale or in dynamic environments. To close this gap, we propose Field-Based Federated Learning (FBFL), a novel approach leveraging macroprogramming and field coordination to address these limitations through: (i) distributed spatial-based leader election for personalization to mitigate non-IID data challenges; and (ii) construction of a self-organizing, hierarchical architecture using advanced macroprogramming patterns. Moreover, FBFL not only overcomes the aforementioned limitations, but also enables the development of more specialized models tailored to the specific data distribution in each subregion. This paper formalizes FBFL and evaluates it extensively using MNIST, FashionMNIST, and Extended MNIST datasets. We demonstrate that, when operating under IID data conditions, FBFL performs comparably to the widely-used FedAvg algorithm. Furthermore, in challenging non-IID scenarios, FBFL not only outperforms FedAvg but also surpasses other state-of-the-art methods, namely FedProx and Scaffold, which have been specifically designed to address non-IID data distributions. Additionally, we showcase the resilience of FBFL’s self-organizing hierarchical architecture against server failures.

[372] Forget the Data and Fine-Tuning! Just Fold the Network to Compress

Dong Wang, Haris Šikić, Lothar Thiele, Olga Saukh

Main category: cs.LG

TL;DR: Model folding is a data-free compression technique that merges similar neurons, reducing model size without fine-tuning or training data, while maintaining performance.

Details

Motivation: To compress large models efficiently without needing data or fine-tuning, addressing limitations of existing methods.

Method: Uses k-means clustering to merge structurally similar neurons, preventing variance issues with novel data-free techniques.

Result: Achieves performance comparable to data-driven methods and outperforms other data-free techniques, especially at high sparsity.

Conclusion: Model folding is effective for compressing large models, ideal for resource-constrained environments.

Abstract: We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers, significantly reducing the model size without the need for fine-tuning or access to training data. Unlike existing methods, model folding preserves data statistics during compression by leveraging k-means clustering, and using novel data-free techniques to prevent variance collapse or explosion. Our theoretical framework and experiments across standard benchmarks, including ResNet18 and LLaMA-7B, demonstrate that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods, especially at high sparsity levels. This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.

[373] Adaptive Computation Pruning for the Forgetting Transformer

Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville

Main category: cs.LG

TL;DR: The paper introduces Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX), dynamically pruning negligible computations to improve efficiency without performance loss.

Details

Motivation: To reduce computational costs in FoX by leveraging its tendency for quick forgetting in attention heads.

Method: ACP dynamically prunes decayed input-output dependencies using a safe threshold, applied to FoX during pretraining.

Result: ACP reduces FLOPs and memory accesses by ~70%, speeds up attention runtime by 50-70%, and increases training throughput by 10-40%.

Conclusion: ACP significantly boosts efficiency in FoX without degrading performance, with greater savings for longer contexts.

Abstract: The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs provably safe pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 50% to 70% reduction in attention runtime (or a 2-3$\times$ speedup) and a roughly 10% to 40% increase in end-to-end training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

[374] LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization

Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani

Main category: cs.LG

TL;DR: LLM-Lasso integrates LLMs with Lasso regression for feature selection, using domain-specific knowledge from natural language to guide penalties, outperforming traditional methods.

Details

Motivation: Traditional feature selection lacks contextual insights from domain knowledge. LLM-Lasso aims to combine data-driven modeling with LLM-extracted knowledge for better performance.

Method: LLM-Lasso uses a RAG pipeline to extract domain knowledge, generates feature penalties via LLM, and adjusts Lasso weights accordingly. It includes validation to ensure robustness.

Result: Outperforms standard Lasso and baselines in biomedical case studies, without prior dataset access.

Conclusion: LLM-Lasso effectively merges LLM reasoning with feature selection, addressing robustness and outperforming traditional methods.

Abstract: We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights. Specifically, the LLM generates penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model, while less relevant features are assigned higher penalties, reducing their influence. Importantly, LLM-Lasso has an internal validation step that determines how much to trust the contextual knowledge in our prediction pipeline. Hence it addresses key challenges in robustness, making it suitable for mitigating potential inaccuracies or hallucinations from the LLM. In various biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning.

[375] PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN

Jiayu Zhang, Zhiyu Zhu, Xinyi Wang, Silin Liao, Zhibo Jin, Flora D. Salim, Huaming Chen

Main category: cs.LG

TL;DR: PAR-AdvGAN improves adversarial example generation using auto-regressive iteration and progressive networks, outperforming AdvGAN and other methods in speed and effectiveness.

Details

Motivation: Deep neural networks are vulnerable to adversarial examples, and existing GAN-based methods like AdvGAN lack iterative refinement for better attack capability.

Method: PAR-AdvGAN integrates auto-regressive iteration and progressive generation networks to enhance adversarial example quality and speed.

Result: PAR-AdvGAN outperforms state-of-the-art methods, achieving 335.5 frames per second on Inception-v3, with superior attack capability.

Conclusion: PAR-AdvGAN is a fast and effective method for generating adversarial examples, advancing the field of adversarial attacks.

Abstract: Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR

[376] ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning

Sahil Sethi, David Chen, Thomas Statchen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones

Main category: cs.LG

TL;DR: ProtoECGNet is a prototype-based deep learning model for interpretable ECG classification, combining 1D and 2D CNNs with prototype loss for multi-label learning. It matches black-box model performance while providing transparent, case-based explanations.

Details

Motivation: Clinical adoption of deep learning ECG models is hindered by lack of transparent explanations. Post hoc methods like saliency maps may not reflect true decision processes.

Method: ProtoECGNet uses a multi-branch architecture: 1D CNN for rhythm classification, 2D CNN for morphology, and another 2D CNN for diffuse abnormalities. It employs a prototype loss for multi-label learning.

Result: Competitive performance on 71 PTB-XL diagnostic labels, with prototypes rated as representative and clear by clinicians.

Conclusion: ProtoECGNet demonstrates scalable prototype learning for complex, multi-label ECG classification, advancing transparent and trustworthy clinical AI.

Abstract: Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model’s true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments, enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model’s projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.

Yunhua Pei, John Cartlidge, Anandadeep Mandal, Daniel Gold, Enrique Marcilio, Riccardo Mazzon

Main category: cs.LG

TL;DR: A transformer-based framework (CMTF) integrates structured and unstructured financial data for improved market forecasting, outperforming existing models.

Details

Motivation: Existing models struggle to effectively align diverse financial data sources, limiting their practical utility.

Method: Introduces CMTF, a transformer-based model with a tensor interpretation module and auto-training pipeline for feature selection and hyperparameter tuning.

Result: CMTF outperforms classical and deep learning baselines in price direction classification on FTSE 100 stock data.

Conclusion: The framework is effective and scalable for real-world cross-modal financial forecasting.

Abstract: Accurate forecasting in financial markets requires integrating diverse data sources, from historical prices to macroeconomic indicators and financial news. However, existing models often fail to align these modalities effectively, limiting their practical use. In this paper, we introduce a transformer-based deep learning framework, Cross-Modal Temporal Fusion (CMTF), that fuses structured and unstructured financial data for improved market prediction. The model incorporates a tensor interpretation module for feature selection and an auto-training pipeline for efficient hyperparameter tuning. Experimental results using FTSE 100 stock data demonstrate that CMTF achieves superior performance in price direction classification compared to classical and deep learning baselines. These findings suggest that our framework is an effective and scalable solution for real-world cross-modal financial forecasting tasks.

[378] Democracy of AI Numerical Weather Models: An Example of Global Forecasting with FourCastNetv2 Made by a University Research Lab Using GPU

Iman Khadir, Shane Stevenson, Henry Li, Kyle Krick, Abram Burrows, David Hall, Stan Posey, Samuel S. P. Shen

Main category: cs.LG

TL;DR: The paper explores democratizing AI-driven weather forecasting for university research groups using GPUs and freely available AI models like FourCastNetv2, addressing challenges and providing guidance.

Details

Motivation: To make AI-driven weather forecasting accessible to resource-constrained university research groups by leveraging existing tools and hardware.

Method: Utilizes FourCastNetv2 for predictions via API and trains the original FourCastNet model using NVIDIA hardware, focusing on data management, training efficiency, and validation.

Result: Demonstrates the capabilities and limitations of NVIDIA A100 GPUs for resource-limited groups, offering practical insights for AI weather forecasting.

Conclusion: The paper serves as a guide for universities to develop AI weather forecasting programs, aiding democratization in the digital economy.

Abstract: This paper demonstrates the feasibility of democratizing AI-driven global weather forecasting models among university research groups by leveraging Graphics Processing Units (GPUs) and freely available AI models, such as NVIDIA’s FourCastNetv2. FourCastNetv2 is an NVIDIA’s advanced neural network for weather prediction and is trained on a 73-channel subset of the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset at single levels and different pressure levels. Although the training specifications for FourCastNetv2 are not released to the public, the training documentation of the model’s first generation, FourCastNet, is available to all users. The training had 64 A100 GPUs and took 16 hours to complete. Although NVIDIA’s models offer significant reductions in both time and cost compared to traditional Numerical Weather Prediction (NWP), reproducing published forecasting results presents ongoing challenges for resource-constrained university research groups with limited GPU availability. We demonstrate both (i) leveraging FourCastNetv2 to create predictions through the designated application programming interface (API) and (ii) utilizing NVIDIA hardware to train the original FourCastNet model. Further, this paper demonstrates the capabilities and limitations of NVIDIA A100’s for resource-limited research groups in universities. We also explore data management, training efficiency, and model validation, highlighting the advantages and challenges of using limited high-performance computing resources. Consequently, this paper and its corresponding GitHub materials may serve as an initial guide for other university research groups and courses related to machine learning, climate science, and data science to develop research and education programs on AI weather forecasting, and hence help democratize the AI NWP in the digital economy.

[379] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Lixuan He, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: AMFT introduces a single-stage algorithm to balance SFT and RL using implicit rewards, achieving state-of-the-art performance and generalization.

Details

Motivation: Addressing the suboptimal trade-offs and catastrophic forgetting in the traditional two-stage SFT-RL pipeline for LLMs.

Method: Proposes Adaptive Meta Fine-Tuning (AMFT), which dynamically balances SFT and RL via a meta-gradient adaptive weight controller.

Result: AMFT achieves superior performance on benchmarks and better generalization on OOD tasks.

Conclusion: AMFT offers a principled and effective paradigm for LLM alignment, with open-sourced implementation.

Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

[380] Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence

Ratun Rahman

Main category: cs.LG

TL;DR: A survey on Federated Learning (FL) covering its architecture, challenges, trends, applications, and future directions.

Details

Motivation: To address privacy and regulatory concerns by enabling decentralized collaborative model training without centralizing sensitive data.

Method: Discusses FL’s lifecycle, technical challenges (e.g., non-IID data, heterogeneity), and privacy mechanisms like differential privacy.

Result: Highlights FL’s applications, benchmark datasets, and evaluation metrics, along with emerging trends.

Conclusion: Identifies open research problems and future directions for scalable, efficient, and trustworthy FL systems.

Abstract: Federated Learning (FL) has emerged as a transformative paradigm in the field of distributed machine learning, enabling multiple clients such as mobile devices, edge nodes, or organizations to collaboratively train a shared global model without the need to centralize sensitive data. This decentralized approach addresses growing concerns around data privacy, security, and regulatory compliance, making it particularly attractive in domains such as healthcare, finance, and smart IoT systems. This survey provides a concise yet comprehensive overview of Federated Learning, beginning with its core architecture and communication protocol. We discuss the standard FL lifecycle, including local training, model aggregation, and global updates. A particular emphasis is placed on key technical challenges such as handling non-IID (non-independent and identically distributed) data, mitigating system and hardware heterogeneity, reducing communication overhead, and ensuring privacy through mechanisms like differential privacy and secure aggregation. Furthermore, we examine emerging trends in FL research, including personalized FL, cross-device versus cross-silo settings, and integration with other paradigms such as reinforcement learning and quantum computing. We also highlight real-world applications and summarize benchmark datasets and evaluation metrics commonly used in FL research. Finally, we outline open research problems and future directions to guide the development of scalable, efficient, and trustworthy FL systems.

[381] Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Main category: cs.LG

TL;DR: Klear-Reasoner is a high-performance reasoning model with detailed training workflow insights, outperforming benchmarks in math and programming.

Details

Motivation: Addressing incomplete disclosure of training details in inference models and improving reasoning capabilities.

Method: Uses long Chain-of-Thought supervised fine-tuning (long CoT SFT) and reinforcement learning (RL) with proposed Gradient-Preserving clipping Policy Optimization (GPPO).

Result: Achieves 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6.

Conclusion: High-quality data and GPPO enhance reasoning, with Klear-Reasoner excelling in complex tasks.

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

[382] Mjölnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density

Minjong Cheon

Main category: cs.LG

TL;DR: Mjölnir is a deep learning framework for global lightning flash density prediction, trained on ERA5 and WWLLN data, achieving high accuracy in reproducing lightning patterns.

Details

Motivation: To leverage AI for accurate global lightning parameterization, addressing the need for data-driven approaches in Earth system models.

Method: Uses InceptionNeXt with SENet and multi-task learning to predict lightning occurrence and magnitude from atmospheric data.

Result: Achieves a 0.96 Pearson correlation for annual mean lightning fields, accurately capturing global and regional patterns.

Conclusion: Mjölnir is effective for lightning prediction and holds promise for integration into AI-based Earth system models.

Abstract: Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mj"olnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mj"olnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mj"olnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).

[383] Hyperbolic Fuzzy C-Means with Adaptive Weight-based Filtering for Efficient Clustering

Swagato Das, Arghya Pratihar, Swagatam Das

Main category: cs.LG

TL;DR: HypeFCM is a new clustering algorithm combining fuzzy clustering with hyperbolic geometry to better handle non-Euclidean data, outperforming traditional methods.

Details

Motivation: Traditional clustering methods like FCM struggle with complex, high-dimensional, and non-Euclidean datasets due to linear separability assumptions.

Method: HypeFCM integrates fuzzy clustering with hyperbolic geometry, using a weight-based filtering mechanism and Dirichlet distribution for initialization, refining centroids in the Poincaré Disc model.

Result: HypeFCM outperforms conventional fuzzy clustering methods in non-Euclidean settings, as shown in experiments on synthetic and real-world datasets.

Conclusion: HypeFCM is robust and effective for clustering in non-Euclidean spaces, addressing limitations of traditional methods.

Abstract: Clustering algorithms play a pivotal role in unsupervised learning by identifying and grouping similar objects based on shared characteristics. Although traditional clustering techniques, such as hard and fuzzy center-based clustering, have been widely used, they struggle with complex, high-dimensional, and non-Euclidean datasets. In particular, the fuzzy $C$-Means (FCM) algorithm, despite its efficiency and popularity, exhibits notable limitations in non-Euclidean spaces. Euclidean spaces assume linear separability and uniform distance scaling, limiting their effectiveness in capturing complex, hierarchical, or non-Euclidean structures in fuzzy clustering. To overcome these challenges, we introduce Filtration-based Hyperbolic Fuzzy C-Means (HypeFCM), a novel clustering algorithm tailored for better representation of data relationships in non-Euclidean spaces. HypeFCM integrates the principles of fuzzy clustering with hyperbolic geometry and employs a weight-based filtering mechanism to improve performance. The algorithm initializes weights using a Dirichlet distribution and iteratively refines cluster centroids and membership assignments based on a hyperbolic metric in the Poincar'e Disc model. Extensive experimental evaluations on $6$ synthetic and $12$ real-world datasets demonstrate that HypeFCM significantly outperforms conventional fuzzy clustering methods in non-Euclidean settings, underscoring its robustness and effectiveness.

[384] Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models

Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani

Main category: cs.LG

TL;DR: AdvCLIP-LoRA enhances adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings, improving performance against attacks like FGSM and PGD without losing much clean accuracy.

Details

Motivation: VLMs like CLIP are vulnerable to adversarial attacks, and existing PEFT techniques like LoRA need robust adaptation methods for few-shot scenarios.

Method: AdvCLIP-LoRA formulates adversarial fine-tuning as a minimax optimization problem with theoretical convergence guarantees.

Result: Empirical results show significant robustness improvements against adversarial attacks across eight datasets, with minimal impact on clean accuracy.

Conclusion: AdvCLIP-LoRA is a practical and theoretically grounded solution for robust VLM adaptation in resource-constrained settings.

Abstract: Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, the first algorithm designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates adversarial fine-tuning as a minimax optimization problem and provides theoretical guarantees for convergence under smoothness and nonconvex-strong-concavity assumptions. Empirical results across eight datasets using ViT-B/16 and ViT-B/32 models show that AdvCLIP-LoRA significantly improves robustness against common adversarial attacks (e.g., FGSM, PGD), without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical and theoretically grounded approach for robust adaptation of VLMs in resource-constrained settings.

[385] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

Main category: cs.LG

TL;DR: The paper introduces Vulnerability-Aware Alignment (VAA) to address harmful fine-tuning risks by identifying and balancing vulnerable data subsets using Group DRO.

Details

Motivation: Harmful fine-tuning (HFT) breaks safety alignment in LLMs, and existing methods fail to address data vulnerability patterns.

Method: VAA estimates data vulnerability, partitions data, and uses Group DRO with adversarial sampling and perturbations for balanced learning.

Result: VAA reduces harmful scores while maintaining task performance, outperforming baselines in four fine-tuning tasks.

Conclusion: VAA effectively mitigates HFT risks by addressing data vulnerability, offering a robust solution for safety alignment.

Abstract: Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into “vulnerable” and “invulnerable” groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.

[386] Saturation Self-Organizing Map

Igor Urbanik, Paweł Gajewski

Main category: cs.LG

TL;DR: SatSOM extends SOMs with a saturation mechanism to combat catastrophic forgetting in continual learning by freezing trained neurons and redirecting learning to underutilized areas.

Details

Motivation: Address catastrophic forgetting in continual learning for Self-Organizing Maps (SOMs).

Method: Introduces SatSOM with a saturation mechanism that reduces learning rate and neighborhood radius for well-trained neurons.

Result: Improved knowledge retention by freezing trained neurons and focusing learning on underutilized map areas.

Conclusion: SatSOM effectively mitigates catastrophic forgetting in SOMs for continual learning.

Abstract: Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.

[387] Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models

Ben Finkelshtein, İsmail İlkan Ceylan, Michael Bronstein, Ron Levie

Main category: cs.LG

TL;DR: The paper proposes a recipe for designing graph foundation models for node-level tasks by investigating necessary symmetries, ensuring generalization across arbitrary graphs and features.

Details

Motivation: To address the lack of broader applicability in graph machine learning by developing models that generalize across diverse graphs and tasks.

Method: Systematically investigates symmetries (node and label permutation-equivariance, feature permutation-invariance), characterizes linear transformations respecting these, and proves universality. Uses these layers for node property prediction.

Result: Validated on 29 real-world datasets, showing strong zero-shot performance and consistent improvement with more training graphs.

Conclusion: The proposed recipe successfully generalizes across graphs and features, offering a foundation for node-level tasks.

Abstract: Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.

[388] Alternates, Assemble! Selecting Optimal Alternates for Citizens’ Assemblies

Angelos Assos, Carmel Baharav, Bailey Flanigan, Ariel Procaccia

Main category: cs.LG

TL;DR: An optimization framework for selecting alternates in citizens’ assemblies improves representation by minimizing misrepresentation using historical data and learning-theoretic methods.

Details

Motivation: Addressing the issue of participant dropout in citizens' assemblies, which leads to unbalanced representation, by improving the selection of alternates.

Method: Introduces an algorithmic approach using learning-theoretic machinery to estimate dropout probabilities and select alternates to minimize expected misrepresentation.

Result: Theoretical guarantees on sample complexity and loss due to mis-estimation, with empirical results showing improved representation and fewer required alternates.

Conclusion: The proposed method significantly enhances representation in citizens’ assemblies compared to existing practices.

Abstract: Citizens’ assemblies are an increasingly influential form of deliberative democracy, where randomly selected people discuss policy questions. The legitimacy of these assemblies hinges on their representation of the broader population, but participant dropout often leads to an unbalanced composition. In practice, dropouts are replaced by preselected alternates, but existing methods do not address how to choose these alternates. To address this gap, we introduce an optimization framework for alternate selection. Our algorithmic approach, which leverages learning-theoretic machinery, estimates dropout probabilities using historical data and selects alternates to minimize expected misrepresentation. Our theoretical bounds provide guarantees on sample complexity (with implications for computational efficiency) and on loss due to dropout probability mis-estimation. Empirical evaluation using real-world data demonstrates that, compared to the status quo, our method significantly improves representation while requiring fewer alternates.

[389] Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions

Maciej Mozolewski, Szymon Bobek, Grzegorz J. Nalepa

Main category: cs.LG

TL;DR: PHAR is a framework that converts numeric feature attributions from explainers like LIME and SHAP into human-readable rules for time series classification, improving interpretability and transparency.

Details

Motivation: Machine learning models for time series classification are hard to interpret due to raw data complexity and high dimensionality. PHAR addresses this by creating structured, readable rules.

Method: PHAR transforms feature attributions into interpretable intervals, uses rule fusion to consolidate rules, and introduces visualization techniques for trade-offs.

Result: PHAR matches rule-based methods like Anchor in performance, scales better for long sequences, and improves explanation fidelity and consistency.

Conclusion: PHAR enhances interpretability, transparency, and practical use in time series classification, resolving conflicting explanations into coherent insights.

Abstract: Explaining machine learning (ML) models for time series (TS) classification remains challenging due to the difficulty of interpreting raw time series and the high dimensionality of the input space. We introduce PHAR-Post-hoc Attribution Rules-a unified framework that transforms numeric feature attributions from post-hoc, instance-wise explainers (e.g., LIME, SHAP) into structured, human-readable rules. These rules define interpretable intervals that indicate where and when key decision boundaries occur, enhancing model transparency. PHAR performs comparably to native rule-based methods, such as Anchor, while scaling more efficiently to long TS sequences and achieving broader instance coverage. A dedicated rule fusion step consolidates rule sets using strategies like weighted selection and lasso-based refinement, balancing key quality metrics: coverage, confidence, and simplicity. This fusion ensures each instance receives a concise and unambiguous rule, improving both explanation fidelity and consistency. We further introduce visualization techniques to illustrate specificity-generalization trade-offs in the derived rules. PHAR resolves conflicting and overlapping explanations-a common effect of the Rashomon phenomenon-into coherent, domain-adaptable insights. Comprehensive experiments on UCR/UEA Time Series Classification Archive demonstrate that PHAR improves interpretability, decision transparency, and practical applicability for TS classification tasks.

[390] Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan

Main category: cs.LG

TL;DR: Echo decouples RL-based post-training for LLMs into inference and training phases, improving efficiency with lightweight synchronization protocols.

Details

Motivation: Current RL systems for LLMs co-locate trajectory sampling and policy optimization, violating SPMD assumptions and reducing efficiency.

Method: Echo uses sequential pull and asynchronous push-pull modes to synchronize inference and training swarms, leveraging heterogeneous hardware.

Result: Echo matches co-located baselines in convergence and reward while utilizing decentralized, edge hardware.

Conclusion: Decentralized, heterogeneous resources can achieve datacentre-grade performance for large-scale RL in LLMs.

Abstract: Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today’s distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous “inference” and “training” swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes policy weights according to API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training four representative RL workloads with Qwen3-4B, Qwen2.5-7B, Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.

[391] Federated Multi-Objective Learning with Controlled Pareto Frontiers

Jiansheng Rao, Jiayi Li, Zhizhi Gong, Soummya Kar, Haoxuan Li

Main category: cs.LG

TL;DR: CR-FMOL introduces a federated multi-objective learning framework with a preference-cone constraint to ensure client-wise Pareto optimality, improving fairness over FedAvg.

Details

Motivation: Address the limitation of FedAvg in under-serving minority clients and the lack of client-wise Pareto optimality in existing methods like FMOL.

Method: Uses a novel preference-cone constraint after local FMGDA/FSMGDA steps, solving a cone-constrained Pareto-MTL sub-problem on the server.

Result: Enhances client fairness on non-IID benchmarks, with comparable accuracy to FedAvg given sufficient training rounds.

Conclusion: CR-FMOL is a promising approach for fairer federated learning, though early-stage performance may lag slightly.

Abstract: Federated learning (FL) is a widely adopted paradigm for privacy-preserving model training, but FedAvg optimise for the majority while under-serving minority clients. Existing methods such as federated multi-objective learning (FMOL) attempts to import multi-objective optimisation (MOO) into FL. However, it merely delivers task-wise Pareto-stationary points, leaving client fairness to chance. In this paper, we introduce Conically-Regularised FMOL (CR-FMOL), the first federated MOO framework that enforces client-wise Pareto optimality through a novel preference-cone constraint. After local federated multi-gradient descent averaging (FMGDA) / federated stochastic multi-gradient descent averaging (FSMGDA) steps, each client transmits its aggregated task-loss vector as an implicit preference; the server then solves a cone-constrained Pareto-MTL sub-problem centred at the uniform vector, producing a descent direction that is Pareto-stationary for every client within its cone. Experiments on non-IID benchmarks show that CR-FMOL enhances client fairness, and although the early-stage performance is slightly inferior to FedAvg, it is expected to achieve comparable accuracy given sufficient training rounds.

[392] Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Record

Mosbah Aouad, Anirudh Choudhary, Awais Farooq, Steven Nevers, Lusine Demirkhanyan, Bhrandon Harris, Suguna Pappu, Christopher Gondi, Ravishankar Iyer

Main category: cs.LG

TL;DR: A multimodal approach using EHR data improves early PDAC detection up to one year before diagnosis, outperforming state-of-the-art methods by 6.5%-15.5% AUC.

Details

Motivation: Early detection of PDAC is challenging due to lack of symptoms and biomarkers.

Method: Combines neural controlled differential equations, pretrained language models, recurrent networks, and cross-attention for multimodal EHR data integration.

Result: Achieves 6.5%-15.5% AUC improvement; identifies known and new biomarkers.

Conclusion: The approach enhances early PDAC detection and biomarker discovery.

Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.

[393] Technical Report: Full-Stack Fine-Tuning for the Q Programming Language

Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka

Main category: cs.LG

TL;DR: The paper presents an open-source approach to adapt large language models (LLMs) to the Q programming language, addressing the challenge of under-represented tasks. It introduces a Leetcode-style evaluation dataset, benchmarks models, and trains reasoning and non-reasoning models, achieving significant accuracy improvements over frontier models like Claude Opus-4 and GPT-4.1.

Details

Motivation: Specialized applications, especially in niche programming languages like Q (used in quantitative finance), are challenging for general-purpose LLMs due to their under-representation on the Internet. This work aims to bridge this gap.

Method: The approach includes creating a Q evaluation dataset, benchmarking major models, and training a suite of models (1.5B to 32B parameters) using pretraining, supervised fine-tuning, and reinforcement learning.

Result: The best model achieves 59% pass@1 accuracy on the Q benchmark, outperforming Claude Opus-4 by 29.5% and GPT-4.1. All models, including the smallest (1.5B), surpass GPT-4.1.

Conclusion: The work provides a scalable methodology for adapting LLMs to niche domains, with released models, code, and data. The techniques are broadly applicable to other under-represented tasks.

Abstract: Even though large language models are becoming increasingly capable, it is still unreasonable to expect them to excel at tasks that are under-represented on the Internet. Leveraging LLMs for specialized applications, particularly in niche programming languages and private domains, remains challenging and largely unsolved. In this work, we address this gap by presenting a comprehensive, open-source approach for adapting LLMs to the Q programming language, a popular tool in quantitative finance that is much less present on the Internet compared to Python, C, Java, and other ``mainstream" languages and is therefore not a strong suit of general-purpose AI models. We introduce a new Leetcode style evaluation dataset for Q, benchmark major frontier models on the dataset, then do pretraining, supervised fine tuning, and reinforcement learning to train a suite of reasoning and non-reasoning models based on the Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our best model achieves a pass@1 accuracy of 59 percent on our Q benchmark, surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent. Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task. In addition to releasing models, code, and data, we provide a detailed blueprint for dataset construction, model pretraining, supervised fine-tuning, and reinforcement learning. Our methodology is broadly applicable, and we discuss how these techniques can be extended to other tasks, including those where evaluation may rely on soft or subjective signals.

[394] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong

Main category: cs.LG

TL;DR: The paper introduces a learning-based framework for semantic cache eviction in LLMs, addressing scalability and uncertainty challenges.

Details

Motivation: High inference costs of LLMs and inefficiencies of traditional caching methods motivate the need for a principled, adaptive solution.

Method: A learning-based framework for semantic cache eviction, with offline optimization and online learning algorithms.

Result: Proposed algorithms outperform baselines on synthetic datasets, offering efficient performance guarantees.

Conclusion: The framework provides a scalable, adaptive solution for semantic caching in LLMs under uncertainty.

Abstract: Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

cs.MA

[395] CARES: Collaborative Agentic Reasoning for Error Detection in Surgery

Chang Han Low, Zhu Zhuo, Ziyue Wang, Jialang Xu, Haofeng Liu, Nazir Sirajudeen, Matthew Boal, Philip J. Edwards, Danail Stoyanov, Nader Francis, Jiehui Zhong, Di Gu, Evangelos B. Mazomenos, Yueming Jin

Main category: cs.MA

TL;DR: MERP dataset and CARES framework address surgical error detection in robotic prostatectomy with zero-shot, clinically-informed reasoning, outperforming existing methods.

Details

Motivation: Current surgical error detection methods lack sufficient training data and struggle with methodological constraints, especially in robotic-assisted surgery.

Method: CARES uses adaptive, error-specific Chain-of-Thought prompts and risk-aware routing to assign tasks to expertise-matched reasoning pathways, involving temporal, spatial, and procedural agents.

Result: CARES achieves 54.3 mF1 on RARP and 52.0 mF1 on MERP datasets, outperforming zero-shot approaches by up to 14% and competing with trained models.

Conclusion: The MERP dataset and CARES framework provide a robust, zero-shot solution for surgical error detection, with public availability of resources.

Abstract: Robotic-assisted surgery (RAS) introduces complex challenges that current surgical error detection methods struggle to address effectively due to limited training data and methodological constraints. Therefore, we construct MERP (Multi-class Error in Robotic Prostatectomy), a comprehensive dataset for error detection in robotic prostatectomy with frame-level annotations featuring six clinically aligned error categories. In addition, we propose CARES (Collaborative Agentic Reasoning for Error Detection in Surgery), a novel zero-shot clinically-informed and risk-stratified agentic reasoning architecture for multi-class surgical error detection. CARES implements adaptive generation of medically informed, error-specific Chain-of-Thought (CoT) prompts across multiple expertise levels. The framework employs risk-aware routing to assign error task to expertise-matched reasoning pathways based on complexity and clinical impact. Subsequently, each pathway decomposes surgical error analysis into three specialized agents with temporal, spatial, and procedural analysis. Each agent analyzes using dynamically selected prompts tailored to the assigned expertise level and error type, generating detailed and transparent reasoning traces. By incorporating clinically informed reasoning from established surgical assessment guidelines, CARES enables zero-shot surgical error detection without prior training. Evaluation demonstrates superior performance with 54.3 mF1 on RARP and 52.0 mF1 on MERP datasets, outperforming existing zero-shot approaches by up to 14% while remaining competitive with trained models. Ablation studies demonstrate the effectiveness of our method. The dataset and code will be publicly available.

[396] Fault Tolerant Multi-Agent Learning with Adversarial Budget Constraints

David Mguni, Yaqi Sun, Haojun Chen, Amir Darabi, Larry Olanrewaju Orimoloye, Yaodong Yang

Main category: cs.MA

TL;DR: MARTA is a framework for training MARL agents to be resilient to agent failures by using adversarial training and a malfunction budget.

Details

Motivation: Ensuring MARL policies remain effective despite inevitable agent failures in real-world deployments.

Method: Proposes MARTA, an adversarial Markov game where an adversary disables agents in high-risk states, and agents learn to mitigate such faults under a malfunction budget.

Result: Theoretical convergence to a Markov perfect equilibrium and state-of-the-art fault-tolerant performance in benchmark environments.

Conclusion: MARTA effectively trains MARL agents to handle severe faults, ensuring robust multi-agent coordination.

Abstract: In multi-agent systems, the safe and reliable execution of tasks often depends on agents correctly coordinating their actions. However, in real-world deployments, failures of computational components are inevitable, presenting a critical challenge: ensuring that multi-agent reinforcement learning (MARL) policies remain effective even when some agents malfunction. We propose the Multi-Agent Robust Training Algorithm (MARTA), a plug-and-play framework for training MARL agents to be resilient to potentially severe faults. MARTA operates in cooperative multi-agent settings where agents may lose the ability to execute their intended actions. It learns to identify failure scenarios that are especially detrimental to system performance and equips agents with strategies to mitigate their impact. At the heart of MARTA is a novel adversarial Markov game in which an adversary – modelled via \emph{Markov switching controls} – learns to disable agents in high-risk state regions, while the remaining agents are trained to \emph{jointly} best-respond to such targeted malfunctions. To ensure practicality, MARTA enforces a malfunction budget, constraining the adversary to a fixed number of failures and learning robust policies accordingly. We provide theoretical guarantees that MARTA converges to a Markov perfect equilibrium, ensuring agents optimally counteract worst-case faults. Empirically, we show that MARTA achieves state-of-the-art fault-tolerant performance across benchmark environments, including Multi-Agent Particle World and Level-Based Foraging.

[397] Converging to Stability in Two-Sided Bandits: The Case of Unknown Preferences on Both Sides of a Matching Market

Gaurab Pokharel, Sanmay Das

Main category: cs.MA

TL;DR: The paper develops algorithms for repeated two-sided matching with uncertain preferences, addressing scenarios where arm preferences are unknown or uncertain.

Details

Motivation: To solve the challenge of stable matching when preferences on both sides are uncertain and no explicit communication exists between agents.

Method: Proposes algorithms where agents start with optimistic beliefs about arm preferences and update them over time, combining these with beliefs about the value of matching.

Result: The algorithms provably converge to stable matchings in settings where arm preferences are unknown or uncertain.

Conclusion: The approach successfully extends stable matching solutions to more challenging scenarios with uncertain preferences on both sides.

Abstract: We study the problem of repeated two-sided matching with uncertain preferences (two-sided bandits), and no explicit communication between agents. Recent work has developed algorithms that converge to stable matchings when one side (the proposers or agents) must learn their preferences, but the preferences of the other side (the proposees or arms) are common knowledge, and the matching mechanism uses simultaneous proposals at each round. We develop new algorithms that provably converge to stable matchings for two more challenging settings: one where the arm preferences are no longer common knowledge, and a second, more general one where the arms are also uncertain about their preferences. In our algorithms, agents start with optimistic beliefs about arms’ preferences and update these preferences over time. The key insight is in how to combine these beliefs about arm preferences with beliefs about the value of matching with an arm conditional on one’s proposal being accepted when choosing whom to propose to.

[398] Making Teams and Influencing Agents: Efficiently Coordinating Decision Trees for Interpretable Multi-Agent Reinforcement Learning

Rex Chen, Stephanie Milani, Zhicheng Zhang, Norman Sadeh, Fei Fang

Main category: cs.MA

TL;DR: HYDRAVIPER is an interpretable MARL algorithm using decision trees to balance performance and computational efficiency.

Details

Motivation: Poor interpretability of MARL policies limits real-world applicability; surrogates must be performant and efficient.

Method: HYDRAVIPER uses decision trees, coordinates agent training, and adaptively allocates interaction budgets.

Result: Matches state-of-the-art performance with lower runtime and maintains Pareto efficiency.

Conclusion: HYDRAVIPER effectively balances interpretability, performance, and computational efficiency in MARL.

Abstract: Poor interpretability hinders the practical applicability of multi-agent reinforcement learning (MARL) policies. Deploying interpretable surrogates of uninterpretable policies enhances the safety and verifiability of MARL for real-world applications. However, if these surrogates are to interact directly with the environment within human supervisory frameworks, they must be both performant and computationally efficient. Prior work on interpretable MARL has either sacrificed performance for computational efficiency or computational efficiency for performance. To address this issue, we propose HYDRAVIPER, a decision tree-based interpretable MARL algorithm. HYDRAVIPER coordinates training between agents based on expected team performance, and adaptively allocates budgets for environment interaction to improve computational efficiency. Experiments on standard benchmark environments for multi-agent coordination and traffic signal control show that HYDRAVIPER matches the performance of state-of-the-art methods using a fraction of the runtime, and that it maintains a Pareto frontier of performance for different interaction budgets.

[399] Frequency Point Game Environment for UAVs via Expert Knowledge and Large Language Model

Jingpu Yang, Hang Zhang, Fengxian Ji, Yufeng Wang, Mingjie Wang, Yizhe Luo, Wenrui Ding

Main category: cs.MA

TL;DR: UAV-FPG is a game-theoretic model for UAV communication, integrating expert knowledge and large language models to improve anti-jamming strategies and dynamic path planning.

Details

Motivation: Addressing challenges in spectrum competition modeling, expert knowledge integration, and opponent behavior prediction in UAV communication.

Method: Proposes UAV-FPG, a game-theoretic model simulating interference and anti-interference strategies, using expert knowledge and large language models for optimization.

Result: Effective integration of expert knowledge and large language models improves path planning, outperforming fixed-path strategies.

Conclusion: UAV-FPG advances anti-jamming strategies and intelligent decision-making in UAV communication systems.

Abstract: Unmanned Aerial Vehicles (UAVs) have made significant advancements in communication stability and security through techniques such as frequency hopping, signal spreading, and adaptive interference suppression. However, challenges remain in modeling spectrum competition, integrating expert knowledge, and predicting opponent behavior. To address these issues, we propose UAV-FPG (Unmanned Aerial Vehicle - Frequency Point Game), a game-theoretic environment model that simulates the dynamic interaction between interference and anti-interference strategies of opponent and ally UAVs in communication frequency bands. The model incorporates a prior expert knowledge base to optimize frequency selection and employs large language models for path planning, simulating a “strong adversary”. Experimental results highlight the effectiveness of integrating the expert knowledge base and the large language model, with the latter significantly improving path planning in dynamic scenarios through iterative interactions, outperforming fixed-path strategies. UAV-FPG provides a robust platform for advancing anti-jamming strategies and intelligent decision-making in UAV communication systems.

cs.MM

[400] Fact-Checking at Scale: Multimodal AI for Authenticity and Context Verification in Online Media

Van-Hoang Phan, Tung-Duong Le-Duc, Long-Khanh Pham, Anh-Thu Le, Quynh-Huong Dinh-Nguyen, Dang-Quan Vo, Hoang-Quoc Nguyen-Son, Anh-Duy Tran, Dang Vu, Minh-Son Dao

Main category: cs.MM

TL;DR: A system for verifying multimedia content’s authenticity and contextual accuracy, integrating visual forensics, textual analysis, and multimodal reasoning, with proven effectiveness in real-world scenarios.

Details

Motivation: The rapid spread of misinformation and disinformation on social media, especially during crises, necessitates robust multimedia verification tools.

Method: A unified verification pipeline combining visual forensics, textual analysis, and multimodal reasoning, with a hybrid approach for detecting out-of-context media.

Result: The system demonstrated effectiveness in diverse real-world scenarios, as evaluated on the ACM Multimedia 2025 Grand Challenge benchmark.

Conclusion: The system advances multimedia verification, providing practical tools for journalists, fact-checkers, and researchers to combat misinformation.

Abstract: The proliferation of multimedia content on social media platforms has dramatically transformed how information is consumed and disseminated. While this shift enables real-time coverage of global events, it also facilitates the rapid spread of misinformation and disinformation, especially during crises such as wars, natural disasters, or elections. The rise of synthetic media and the reuse of authentic content in misleading contexts have intensified the need for robust multimedia verification tools. In this paper, we present a comprehensive system developed for the ACM Multimedia 2025 Grand Challenge on Multimedia Verification. Our system assesses the authenticity and contextual accuracy of multimedia content in multilingual settings and generates both expert-oriented verification reports and accessible summaries for the general public. We introduce a unified verification pipeline that integrates visual forensics, textual analysis, and multimodal reasoning, and propose a hybrid approach to detect out-of-context (OOC) media through semantic similarity, temporal alignment, and geolocation cues. Extensive evaluations on the Grand Challenge benchmark demonstrate the system’s effectiveness across diverse real-world scenarios. Our contributions advance the state of the art in multimedia verification and offer practical tools for journalists, fact-checkers, and researchers confronting information integrity challenges in the digital age.

[401] DASC: Depth-of-Field Aware Scene Complexity Metric for 3D Visualization on Light Field Display

Kamran Akbar, Robert Bregovic, Federica Battisti

Main category: cs.MM

TL;DR: The paper proposes a DoF Aware Scene Complexity (DASC) metric to improve 3D content quality in light field displays by predicting preferred blur levels based on scene complexity.

Details

Motivation: Light field displays suffer from aliasing artifacts outside the Depth of Field (DoF), and current blurring solutions can remove scene details. The study aims to address this by evaluating observer preferences for blur levels.

Method: The research introduces the DASC metric, evaluates observer preferences for blur levels, and proposes a predictive model based on these preferences.

Result: Subjective studies reveal preferred blur levels for different scenes, leading to a model that predicts optimal blurring based on the DASC metric.

Conclusion: The DASC metric and predictive model enhance 3D content quality in light field displays by balancing detail preservation and artifact reduction.

Abstract: Light field display is one of the technologies providing 3D immersive visualization. However, a light field display generates only a limited number of light rays which results in finite angular and spatial resolutions. Therefore, 3D content can be shown with high quality only within a narrow depth range notated as Depth of Field (DoF) around the display screen. Outside this range, due to the appearance of aliasing artifacts, the quality degrades proportionally to the distance from the screen. One solution to mitigate the artifacts is depth of field rendering which blurs the content in the distorted regions, but can result in the removal of scene details. This research focuses on proposing a DoF Aware Scene Complexity (DASC) metric that characterizes 3D content based on geometrical and positional factors considering the light field display’s DoF. In this research, we also evaluate the observers’ preference across different level of blurriness caused by DoF rendering ranging from sharp, aliased scenes to overly smoothed alias-free scenes. We have conducted this study over multiple scenes that we created to account for different types of content. Based on the outcome of subjective studies, we propose a model that takes the value of DASC metric as input and predicts the preferred level of blurring for the given scene as output.

[402] Gotta Hear Them All: Towards Sound Source Aware Audio Generation

Wei Guo, Heng Wang, Jianbo Ma, Weidong Cai

Main category: cs.MM

TL;DR: The paper proposes SS2A, a sound source-aware audio generator, to improve audio synthesis by focusing on local sound sources, achieving state-of-the-art performance in image-to-audio tasks.

Details

Motivation: Existing audio synthesis methods lack immersiveness and expressiveness due to overlooking local sound sources. The paper aims to address this by explicitly modeling sound sources.

Method: SS2A uses visual detection and cross-modality translation to perceive sound sources, learns a Cross-Modal Sound Source (CMSS) Manifold for semantic disambiguation, and mixes CMSS semantics into an audio representation for generation. A novel dataset (VGGS3) and a Sound Source Matching Score are introduced.

Result: SS2A achieves state-of-the-art performance in image-to-audio tasks and demonstrates intuitive synthesis control by compositing vision, text, and audio. It also performs well in video-to-audio tasks with temporal aggregation.

Conclusion: Explicit sound source modeling significantly improves audio synthesis, as shown by SS2A’s performance and versatility in multimodal tasks.

Abstract: Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to clearly measure localized audio relevance. With the effectiveness of explicit sound source modeling, SS2A achieves state-of-the-art performance in extensive image-to-audio tasks. We also qualitatively demonstrate SS2A’s ability to achieve intuitive synthesis control by compositing vision, text, and audio conditions. Furthermore, we show that our sound source modeling can achieve competitive video-to-audio performance with a straightforward temporal aggregation mechanism.

[403] LayLens: Improving Deepfake Understanding through Simplified Explanations

Abhijeet Narang, Parul Gupta, Liuyijia Su, Abhinav Dhall

Main category: cs.MM

TL;DR: LayLens is a tool simplifying deepfake understanding for all users via explainable detection, natural language simplification, and visual reconstruction, improving clarity and confidence.

Details

Motivation: To bridge the gap between technical deepfake explanations and human understanding, making forensics accessible to everyone.

Method: A three-stage pipeline: explainable detection, natural language simplification, and visual reconstruction, presented via an interface.

Result: User study showed improved clarity, reduced cognitive load, and increased confidence in identifying deepfakes.

Conclusion: LayLens advances transparent, user-centric deepfake forensics.

Abstract: This demonstration paper presents $\mathbf{LayLens}$, a tool aimed to make deepfake understanding easier for users of all educational backgrounds. While prior works often rely on outputs containing technical jargon, LayLens bridges the gap between model reasoning and human understanding through a three-stage pipeline: (1) explainable deepfake detection using a state-of-the-art forgery localization model, (2) natural language simplification of technical explanations using a vision-language model, and (3) visual reconstruction of a plausible original image via guided image editing. The interface presents both technical and layperson-friendly explanations in addition to a side-by-side comparison of the uploaded and reconstructed images. A user study with 15 participants shows that simplified explanations significantly improve clarity and reduce cognitive load, with most users expressing increased confidence in identifying deepfakes. LayLens offers a step toward transparent, trustworthy, and user-centric deepfake forensics.

eess.AS

[404] Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations

Ryo Aihara, Yoshiki Masuyama, Gordon Wichern, François G. Germain, Jonathan Le Roux

Main category: eess.AS

TL;DR: A neural audio codec (NAC) with structured disentanglement is proposed, matching conventional NACs in reconstruction and voice conversion (VC) performance.

Details

Motivation: Speech conveys both linguistic and paralinguistic features; entangled encoding limits flexibility, especially for tasks like VC.

Method: Develops a discrete NAC using $k$-means quantization with self-supervised features for disentanglement.

Result: Achieves comparable reconstruction to conventional NACs and matches VC effectiveness.

Conclusion: The disentangled NAC offers flexibility without sacrificing performance.

Abstract: Neural audio codecs (NACs), which use neural networks to generate compact audio representations, have garnered interest for their applicability to many downstream tasks – especially quantized codecs due to their compatibility with large language models. However, unlike text, speech conveys not only linguistic content but also rich paralinguistic features. Encoding these elements in an entangled fashion may be suboptimal, as it limits flexibility. For instance, voice conversion (VC) aims to convert speaker characteristics while preserving the original linguistic content, which requires a disentangled representation. Inspired by VC methods utilizing $k$-means quantization with self-supervised features to disentangle phonetic information, we develop a discrete NAC capable of structured disentanglement. Experimental evaluations show that our approach achieves reconstruction performance on par with conventional NACs that do not explicitly perform disentanglement, while also matching the effectiveness of conventional VC techniques.

[405] Joint decoding method for controllable contextual speech recognition based on Speech LLM

Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu

Main category: eess.AS

TL;DR: The paper proposes a joint decoding method to control contextual information in speech recognition, addressing limitations of direct prompt-based injection.

Details

Motivation: Current methods rely on model attention for contextual biasing, lacking explicit control over information injection.

Method: A joint decoding method is introduced to explicitly control contextual information injection.

Result: The method achieves superior recognition performance and enables sensitive word suppression. It also imparts long contextual capabilities to Speech LLMs without pre-training.

Conclusion: The proposed joint decoding method effectively controls contextual information, enhancing recognition and extending capabilities of Speech LLMs.

Abstract: Contextual speech recognition refers to the ability to identify preferences for specific content based on contextual information. Recently, leveraging the contextual understanding capabilities of Speech LLM to achieve contextual biasing by injecting contextual information through prompts have emerged as a research hotspot.However, the direct information injection method via prompts relies on the internal attention mechanism of the model, making it impossible to explicitly control the extent of information injection. To address this limitation, we propose a joint decoding method to control the contextual information. This approach enables explicit control over the injected contextual information and achieving superior recognition performance. Additionally, Our method can also be used for sensitive word suppression recognition.Furthermore, experimental results show that even Speech LLM not pre-trained on long contextual data can acquire long contextual capabilities through our method.

[406] MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs

Xiaoxue Gao, Huayun Zhang, Nancy F. Chen

Main category: eess.AS

TL;DR: MultiAiTutor is a multilingual AI tutor for child-friendly speech generation in low-resource languages, outperforming baselines in educational contexts.

Details

Motivation: Addressing the challenge of high-quality, child-friendly speech generation for low-resource languages in education.

Method: Leverages LLM architecture for multilingual speech generation, focusing on Singaporean-accent Mandarin, Malay, and Tamil.

Result: Superior performance in objective metrics and subjective evaluations compared to baseline methods.

Conclusion: MultiAiTutor effectively enhances language learning for children in diverse cultural contexts.

Abstract: Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods.

[407] Transient Noise Removal via Diffusion-based Speech Inpainting

Mordehay Moradi, Sharon Gannot

Main category: eess.AS

TL;DR: PGDI is a diffusion-based speech inpainting framework that accurately reconstructs missing or corrupted speech segments up to one second, preserving speaker identity and environmental factors.

Details

Motivation: To address limitations of previous methods in handling speaker variability and long gap lengths in speech inpainting.

Method: Uses classifier guidance, specifically phoneme-level guidance, to improve reconstruction fidelity in a speaker-independent manner.

Result: PGDI outperforms in inpainting performance, handling challenging acoustic conditions and long masked segments.

Conclusion: PGDI is robust and effective for real-world applications, even without transcript access during inference.

Abstract: In this paper, we present PGDI, a diffusion-based speech inpainting framework for restoring missing or severely corrupted speech segments. Unlike previous methods that struggle with speaker variability or long gap lengths, PGDI can accurately reconstruct gaps of up to one second in length while preserving speaker identity, prosody, and environmental factors such as reverberation. Central to this approach is classifier guidance, specifically phoneme-level guidance, which substantially improves reconstruction fidelity. PGDI operates in a speaker-independent manner and maintains robustness even when long segments are completely masked by strong transient noise, making it well-suited for real-world applications, such as fireworks, door slams, hammer strikes, and construction noise. Through extensive experiments across diverse speakers and gap lengths, we demonstrate PGDI’s superior inpainting performance and its ability to handle challenging acoustic conditions. We consider both scenarios, with and without access to the transcript during inference, showing that while the availability of text further enhances performance, the model remains effective even in its absence. For audio samples, visit: https://mordehaym.github.io/PGDI/

[408] EGGCodec: A Robust Neural Encodec Framework for EGG Reconstruction and F0 Extraction

Rui Feng, Yuang Chen, Yu Hu, Jun Du, Jiahong Yuan

Main category: eess.AS

TL;DR: EGGCodec is a neural Encodec framework for EGG signal reconstruction and F0 extraction, outperforming existing methods with improved accuracy and efficiency.

Details

Motivation: To enhance the accuracy and generalization of EGG signal reconstruction and F0 extraction by leveraging multi-scale frequency-domain and time-domain loss functions.

Method: Proposes a multi-scale frequency-domain loss function and time-domain correlation loss, removes the GAN discriminator, and uses reconstructed EGG signals for F0 extraction.

Result: Reduces MAE from 14.14 Hz to 13.69 Hz and improves VDE by 38.2%, outperforming state-of-the-art methods.

Conclusion: EGGCodec is a robust and efficient framework for EGG signal reconstruction and F0 extraction, validated by ablation studies.

Abstract: This letter introduces EGGCodec, a robust neural Encodec framework engineered for electroglottography (EGG) signal reconstruction and F0 extraction. We propose a multi-scale frequency-domain loss function to capture the nuanced relationship between original and reconstructed EGG signals, complemented by a time-domain correlation loss to improve generalization and accuracy. Unlike conventional Encodec models that extract F0 directly from features, EGGCodec leverages reconstructed EGG signals, which more closely correspond to F0. By removing the conventional GAN discriminator, we streamline EGGCodec’s training process without compromising efficiency, incurring only negligible performance degradation. Trained on a widely used EGG-inclusive dataset, extensive evaluations demonstrate that EGGCodec outperforms state-of-the-art F0 extraction schemes, reducing mean absolute error (MAE) from 14.14 Hz to 13.69 Hz, and improving voicing decision error (VDE) by 38.2%. Moreover, extensive ablation experiments validate the contribution of each component of EGGCodec.

[409] LPGNet: A Lightweight Network with Parallel Attention and Gated Fusion for Multimodal Emotion Recognition

Zhining He, Yang Xiao

Main category: eess.AS

TL;DR: LPGNet is a lightweight, efficient model for emotion recognition in conversations (ERC), addressing high computational costs and speaker dependency issues of Transformer-based models. It uses parallel attention and gated fusion for multimodal inputs, achieving high accuracy without speaker embeddings.

Details

Motivation: Transformer-based models for ERC are computationally expensive and heavily rely on speaker information, limiting their real-world generalization. LPGNet aims to solve these issues.

Method: LPGNet employs a Lightweight Parallel Interaction Attention (LPIA) module for efficient multimodal relationship modeling and a dual-gated fusion method for dynamic feature combination. It eliminates speaker embeddings.

Result: LPGNet achieves over 87% accuracy and F1-score on the IEMOCAP dataset, outperforming baselines with fewer parameters and better generalization.

Conclusion: LPGNet is a lightweight, efficient, and speaker-independent solution for ERC, demonstrating superior performance and generalization.

Abstract: Emotion recognition in conversations (ERC) aims to predict the emotional state of each utterance by using multiple input types, such as text and audio. While Transformer-based models have shown strong performance in this task, they often face two major issues: high computational cost and heavy dependence on speaker information. These problems reduce their ability to generalize in real-world conversations. To solve these challenges, we propose LPGNet, a Lightweight network with Parallel attention and Gated fusion for multimodal ERC. The main part of LPGNet is the Lightweight Parallel Interaction Attention (LPIA) module. This module replaces traditional stacked Transformer layers with parallel dot-product attention, which can model both within-modality and between-modality relationships more efficiently. To improve emotional feature learning, LPGNet also uses a dual-gated fusion method. This method filters and combines features from different input types in a flexible and dynamic way. In addition, LPGNet removes speaker embeddings completely, which allows the model to work independently of speaker identity. Experiments on the IEMOCAP dataset show that LPGNet reaches over 87% accuracy and F1-score in 4-class emotion classification. It outperforms strong baseline models while using fewer parameters and showing better generalization across speakers.

[410] DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition

Alexander Polok, Santosh Kesiraju, Karel Beneš, Bolaji Yusuf, Lukáš Burget, Jan Černocký

Main category: eess.AS

TL;DR: The paper introduces DeCRED, a decoder-centric regularization method for encoder-decoder ASR models, improving robustness and generalization by adding auxiliary classifiers to the decoder. It reduces internal LM perplexity and achieves WER improvements in various test sets.

Details

Motivation: To enhance the robustness and generalization of encoder-decoder ASR models by regularizing the internal language model of the decoder.

Method: Proposes DeCRED, which adds auxiliary classifiers to the decoder for next token prediction via intermediate logits.

Result: DeCRED reduces internal LM perplexity by 36.6% and improves WER in multiple test sets, outperforming baselines and other methods like InterCTC.

Conclusion: DeCRED is effective, achieving competitive WERs with less training data and fewer parameters compared to models like OWSM v3.1 and Whisper-medium.

Abstract: This paper presents a simple yet effective regularization for the internal language model induced by the decoder in encoder-decoder ASR models, thereby improving robustness and generalization in both in- and out-of-domain settings. The proposed method, Decoder-Centric Regularization in Encoder-Decoder (DeCRED), adds auxiliary classifiers to the decoder, enabling next token prediction via intermediate logits. Empirically, DeCRED reduces the mean internal LM BPE perplexity by 36.6% relative to 11 test sets. Furthermore, this translates into actual WER improvements over the baseline in 5 of 7 in-domain and 3 of 4 out-of-domain test sets, reducing macro WER from 6.4% to 6.3% and 18.2% to 16.2%, respectively. On TEDLIUM3, DeCRED achieves 7.0% WER, surpassing the baseline and encoder-centric InterCTC regularization by 0.6% and 0.5%, respectively. Finally, we compare DeCRED with OWSM v3.1 and Whisper-medium, showing competitive WERs despite training on much less data with fewer parameters.

[411] Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context Representation

Soo-Whan Chung, Min-Seok Choi

Main category: eess.AS

TL;DR: A novel speech restoration method using context-aware conditioning with diffusion-based UNIVERSE++ and refined CLAP embeddings (ACX) outperforms content-based approaches.

Details

Motivation: To improve speech restoration by leveraging contextual information (environmental attributes) rather than just linguistic/speaker attributes.

Method: Uses diffusion-based UNIVERSE++ with CLAP embeddings and proposes ACX to refine these embeddings for better distortion handling.

Result: Context-aware conditioning enhances restoration performance and stability across diverse distortions, reducing variability.

Conclusion: ACX-based context-aware conditioning is more effective than content-based methods for speech restoration.

Abstract: This paper introduces a novel approach to speech restoration by integrating a context-related conditioning strategy. Specifically, we employ the diffusion-based generative restoration model, UNIVERSE++, as a backbone to evaluate the effectiveness of contextual representations. We incorporate acoustic context embeddings extracted from the CLAP model, which capture the environmental attributes of input audio. Additionally, we propose an Acoustic Context (ACX) representation that refines CLAP embeddings to better handle various distortion factors and their intensity in speech signals. Unlike content-based approaches that rely on linguistic and speaker attributes, ACX provides contextual information that enables the restoration model to distinguish and mitigate distortions better. Experimental results indicate that context-aware conditioning improves both restoration performance and its stability across diverse distortion conditions, reducing variability compared to content-based methods.

[412] Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee

Main category: eess.AS

TL;DR: Early-layer features in SSL models outperform last-layer features for speech quality assessment, improving performance and reducing complexity.

Details

Motivation: Prior SQA models focus on last-layer features, leaving intermediate layers underexplored despite their potential.

Method: Systematically evaluate layer-wise features of SSL models (Wav2Vec2, HuBERT, WavLM) using a lightweight regression network for MOS prediction.

Result: Early-layer features consistently outperform or match last-layer features, surpassing conventional and state-of-the-art MOS prediction models.

Conclusion: Early-layer selection in SSL models enhances performance and reduces system complexity for SQA tasks.

Abstract: Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

[413] TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg

Main category: eess.AS

TL;DR: A universal ASR context-biasing framework is proposed, supporting CTC, Transducers, and Attention Encoder-Decoder models, with GPU-accelerated word boosting for efficient decoding.

Details

Motivation: Existing context-biasing approaches require additional training, slow decoding, or limit ASR system types, prompting a need for a more versatile solution.

Method: The framework uses a GPU-accelerated word boosting tree for shallow fusion with greedy and beam search decoding, handling up to 20K key phrases efficiently.

Result: The method outperforms open-source context-biasing approaches in accuracy and decoding speed, even with large key phrase sets.

Conclusion: The proposed framework is effective, efficient, and open-sourced in the NeMo toolkit, addressing limitations of existing methods.

Abstract: Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.

[414] XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

Main category: eess.AS

TL;DR: XEmoRAG enables zero-shot emotion transfer from Chinese to Thai speech without parallel data, using LLM-based embeddings and flow-matching alignment for natural prosody.

Details

Motivation: Addressing challenges in cross-lingual emotion transfer due to lack of parallel data, foreign accents, and language-specific prosody.

Method: Extracts language-agnostic emotional embeddings from Chinese speech, retrieves matched Thai utterances, and uses flow-matching alignment for prosody.

Result: Synthesizes expressive Thai speech from Chinese references without explicit labels, preserving speaker and emotion.

Conclusion: XEmoRAG offers flexible, low-resource emotion transfer across languages, demonstrated effectively in Chinese-to-Thai synthesis.

Abstract: Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches, ensuring natural prosody. It also blends Chinese timbre into the Thai synthesis, enhancing rhythmic accuracy and emotional expression, while preserving speaker characteristics and emotional consistency. Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels. These results highlight XEmoRAG’s capability to achieve flexible and low-resource emotional transfer across languages. Our demo is available at https://tlzuo-lesley.github.io/Demo-page/ .

eess.IV

[415] Variational volume reconstruction with the Deep Ritz Method

Conor Rowan, Sumedh Soman, John A. Evans

Main category: eess.IV

TL;DR: A novel variational volume reconstruction method using the Deep Ritz method addresses challenges in sparse, noisy slice data, avoiding segmentation and reducing computational costs.

Details

Motivation: The method is motivated by biomedical imaging needs, particularly MRI-based slice-to-volume reconstruction, tackling issues like noisy data, limited slices, and high computational costs.

Method: Combines a regression loss for direct noisy data handling with a modified Cahn-Hilliard energy for regularization, using neural networks and Monte Carlo integration with ADAM optimization.

Result: Produces high-quality reconstructed volumes quickly, even with sparse and noisy data.

Conclusion: The approach effectively overcomes key challenges in variational volume reconstruction, offering a fast and reliable solution.

Abstract: We present a novel approach to variational volume reconstruction from sparse, noisy slice data using the Deep Ritz method. Motivated by biomedical imaging applications such as MRI-based slice-to-volume reconstruction (SVR), our approach addresses three key challenges: (i) the reliance on image segmentation to extract boundaries from noisy grayscale slice images, (ii) the need to reconstruct volumes from a limited number of slice planes, and (iii) the computational expense of traditional mesh-based methods. We formulate a variational objective that combines a regression loss designed to avoid image segmentation by operating on noisy slice data directly with a modified Cahn-Hilliard energy incorporating anisotropic diffusion to regularize the reconstructed geometry. We discretize the phase field with a neural network, approximate the objective at each optimization step with Monte Carlo integration, and use ADAM to find the minimum of the approximated variational objective. While the stochastic integration may not yield the true solution to the variational problem, we demonstrate that our method reliably produces high-quality reconstructed volumes in a matter of seconds, even when the slice data is sparse and noisy.

[416] Preprocessing Algorithm Leveraging Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance

Praveen Sumanasekara, Athulya Ratnayake, Buddhi Wijenayake, Keshawa Ratnayake, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath

Main category: eess.IV

TL;DR: A novel preprocessing algorithm corrects scale-induced spectral variability in hyperspectral unmixing, improving accuracy and convergence by isolating multiplicative effects.

Details

Motivation: Large-scale spectral variability due to factors like topography and illumination degrades unmixing performance, necessitating a preprocessing solution.

Method: The algorithm isolates and compensates for multiplicative scale effects, providing a cleaner input for unmixing methods.

Result: Extensive validation shows error reductions up to 50%, improving performance across state-of-the-art unmixing algorithms.

Conclusion: The algorithm is a complementary step for accurate unmixing, with potential as a key component in hyperspectral pipelines.

Abstract: Spectral variability significantly impacts the accuracy and convergence of hyperspectral unmixing algorithms. While many methods address complex spectral variability, large-scale variations in spectral signature scale caused by factors such as topography, illumination, and shadowing remain a major challenge. These variations often degrade unmixing performance and complicate model fitting. In this paper, we propose a novel preprocessing algorithm that corrects scale-induced spectral variability prior to unmixing. By isolating and compensating for these large-scale multiplicative effects, the algorithm provides a cleaner input, enabling unmixing methods to focus more effectively on modeling nonlinear spectral variability and abundance estimation. We present a rigorous mathematical framework to describe scale variability and extensive experimental validation of the proposed algorithm. Furthermore, the algorithm’s impact is evaluated across a broad spectrum of state-of-the-art unmixing algorithms on two synthetic and two real hyperspectral datasets. The proposed preprocessing step consistently improves the performance of these algorithms, including those specifically designed to handle spectral variability, with error reductions close to 50% in many cases. This demonstrates that scale correction acts as a complementary step, facilitating more accurate unmixing by existing methods. The algorithm’s generality and significant impact highlight its potential as a key component in practical hyperspectral unmixing pipelines. The implementation code will be made publicly available upon publication.

[417] Frequency-Assisted Adaptive Sharpening Scheme Considering Bitrate and Quality Tradeoff

Yingxue Pang, Shijie Zhao, Haiqiang Wang, Gen Zhan, Junlin Li, Li Zhang

Main category: eess.IV

TL;DR: The paper proposes FreqSP, a model to predict the optimal sharpening level for videos, balancing quality and bitrate.

Details

Motivation: Sharpening improves video quality but can increase bitrate and cause over-sharpening. A method to balance these trade-offs is needed.

Method: FreqSP uses CNN features and high-frequency components to estimate the optimal sharpening level from uncompressed videos.

Result: Experiments show FreqSP effectively predicts the optimal sharpening level, improving quality while controlling bitrate.

Conclusion: FreqSP successfully addresses the trade-off between video sharpening quality and bitrate costs.

Abstract: Sharpening is a widely adopted technique to improve video quality, which can effectively emphasize textures and alleviate blurring. However, increasing the sharpening level comes with a higher video bitrate, resulting in degraded Quality of Service (QoS). Furthermore, the video quality does not necessarily improve with increasing sharpening levels, leading to issues such as over-sharpening. Clearly, it is essential to figure out how to boost video quality with a proper sharpening level while also controlling bandwidth costs effectively. This paper thus proposes a novel Frequency-assisted Sharpening level Prediction model (FreqSP). We first label each video with the sharpening level correlating to the optimal bitrate and quality tradeoff as ground truth. Then taking uncompressed source videos as inputs, the proposed FreqSP leverages intricate CNN features and high-frequency components to estimate the optimal sharpening level. Extensive experiments demonstrate the effectiveness of our method.

[418] A new dataset and comparison for multi-camera frame synthesis

Conall Daly, Anil Kokaram

Main category: eess.IV

TL;DR: A new multi-camera dataset is introduced to fairly compare frame interpolation and view synthesis methods, revealing performance differences between classical, deep learning, and 3D Gaussian Splatting approaches on real and synthetic data.

Details

Motivation: Existing datasets for frame interpolation and view synthesis are biased, making direct comparison challenging. A fair evaluation requires a neutral dataset.

Method: A custom-built dense linear camera array is used to create a novel dataset. Classical and deep learning frame interpolators are compared with 3D Gaussian Splatting for view in-betweening.

Result: Deep learning methods don’t significantly outperform classical methods on real data, while 3D Gaussian Splatting underperforms. On synthetic data, 3D Gaussian Splatting excels.

Conclusion: The dataset enables fair comparison, showing context-dependent performance of methods, with 3D Gaussian Splatting better for synthetic scenes and frame interpolators for real data.

Abstract: Many methods exist for frame synthesis in image sequences but can be broadly categorised into frame interpolation and view synthesis techniques. Fundamentally, both frame interpolation and view synthesis tackle the same task, interpolating a frame given surrounding frames in time or space. However, most frame interpolation datasets focus on temporal aspects with single cameras moving through time and space, while view synthesis datasets are typically biased toward stereoscopic depth estimation use cases. This makes direct comparison between view synthesis and frame interpolation methods challenging. In this paper, we develop a novel multi-camera dataset using a custom-built dense linear camera array to enable fair comparison between these approaches. We evaluate classical and deep learning frame interpolators against a view synthesis method (3D Gaussian Splatting) for the task of view in-betweening. Our results reveal that deep learning methods do not significantly outperform classical methods on real image data, with 3D Gaussian Splatting actually underperforming frame interpolators by as much as 3.5 dB PSNR. However, in synthetic scenes, the situation reverses – 3D Gaussian Splatting outperforms frame interpolation algorithms by almost 5 dB PSNR at a 95% confidence level.

[419] Efficient motion-based metrics for video frame interpolation

Conall Daly, Darren Ramsook, Anil Kokaram

Main category: eess.IV

TL;DR: The paper explores simple motion field processing methods for assessing the perceptual quality of video frame interpolation (VFI) and proposes a divergence-based motion metric. This metric correlates reasonably with perceptual scores and is computationally efficient.

Details

Motivation: Assessing the perceptual quality of interpolated video frames is an ongoing challenge, and existing metrics like PSNR or SSIM may not align well with human perception.

Method: The authors investigate motion field processing techniques and propose a divergence-based motion metric, evaluated using the BVI-VFI dataset.

Result: The proposed metric shows reasonable correlation with perceptual scores (PLCC=0.51) and is computationally faster (x2.7 speedup) than FloLPIPS. It also favors perceptually pleasing frames over those with high PSNR/SSIM scores.

Conclusion: The divergence-based motion metric is a promising, efficient tool for evaluating VFI algorithms, aligning better with human perception than traditional metrics.

Abstract: Video frame interpolation (VFI) offers a way to generate intermediate frames between consecutive frames of a video sequence. Although the development of advanced frame interpolation algorithms has received increased attention in recent years, assessing the perceptual quality of interpolated content remains an ongoing area of research. In this paper, we investigate simple ways to process motion fields, with the purposes of using them as video quality metric for evaluating frame interpolation algorithms. We evaluate these quality metrics using the BVI-VFI dataset which contains perceptual scores measured for interpolated sequences. From our investigation we propose a motion metric based on measuring the divergence of motion fields. This metric correlates reasonably with these perceptual scores (PLCC=0.51) and is more computationally efficient (x2.7 speedup) compared to FloLPIPS (a well known motion-based metric). We then use our new proposed metrics to evaluate a range of state of the art frame interpolation metrics and find our metrics tend to favour more perceptual pleasing interpolated frames that may not score highly in terms of PSNR or SSIM.

[420] A Data-driven Loss Weighting Scheme across Heterogeneous Tasks for Image Denoising

Xiangyu Rui, Xiangyong Cao, Xile Zhao, Deyu Meng, Michael K. NG

Main category: eess.IV

TL;DR: A data-driven loss weighting (DLW) scheme is proposed to enhance variational denoising models by dynamically balancing data fidelity and regularization terms using a neural network-trained weight function.

Details

Motivation: The challenge of assigning optimal weights in variational denoising models for complex noise patterns (e.g., impulse or mixed noise) and balancing fidelity with regularization terms motivates this work.

Method: DLW employs a neural network to predict weights for noisy images, trained via a bilevel optimization framework: lower level solves denoising models, upper level minimizes restored-to-clean image distance.

Result: DLW significantly improves denoising performance for complex noise patterns and demonstrates transferability to heterogeneous tasks beyond training data.

Conclusion: DLW effectively balances noise removal and regularization, generalizes well to unseen noise patterns, and offers practical implementation for denoising models.

Abstract: In a variational denoising model, weight in the data fidelity term plays the role of enhancing the noise-removal capability. It is profoundly correlated with noise information, while also balancing the data fidelity and regularization terms. However, the difficulty of assigning weight is expected to be substantial when the noise pattern is beyond independent identical Gaussian distribution, e.g., impulse noise, stripe noise, or a mixture of several patterns, etc. Furthermore, how to leverage weight to balance the data fidelity and regularization terms is even less evident. In this work, we propose a data-driven loss weighting (DLW) scheme to address these issues. Specifically, DLW trains a parameterized weight function (i.e., a neural network) that maps the noisy image to the weight. The training is achieved by a bilevel optimization framework, where the lower level problem is solving several denoising models with the same weight predicted by the weight function and the upper level problem minimizes the distance between the restored image and the clean image. In this way, information from both the noise and the regularization can be efficiently extracted to determine the weight function. DLW also facilitates the easy implementation of a trained weight function on denoising models. Numerical results verify the remarkable performance of DLW on improving the ability of various variational denoising models to handle different complex noise. This implies that DLW has the ability to transfer the noise knowledge at the model level to heterogeneous tasks beyond the training ones and the generalization theory underlying DLW is studied, validating its intrinsic transferability.

[421] Style transfer between Microscopy and Magnetic Resonance Imaging via Generative Adversarial Network in small sample size settings

Monika Pytlarz, Adrian Onicas, Alessandro Crimi

Main category: eess.IV

TL;DR: A method using cGAN to generate microscopic histological images from MRI scans of the human corpus callosum, aiming to avoid invasive biopsies.

Details

Motivation: To enable histopathological analysis without invasive biopsies by augmenting MRI with microscopic imaging.

Method: Conditional generative adversarial network (cGAN) trained on paired MRI and microscopy images.

Result: The framework reliably synthesizes histology images from MRI scans, even with high-resolution histologies and lower-resolution MRI.

Conclusion: The tool shows promise for non-invasive histopathological analysis and educational purposes.

Abstract: Cross-modal augmentation of Magnetic Resonance Imaging (MRI) and microscopic imaging based on the same tissue samples is promising because it can allow histopathological analysis in the absence of an underlying invasive biopsy procedure. Here, we tested a method for generating microscopic histological images from MRI scans of the human corpus callosum using conditional generative adversarial network (cGAN) architecture. To our knowledge, this is the first multimodal translation of the brain MRI to histological volumetric representation of the same sample. The technique was assessed by training paired image translation models taking sets of images from MRI scans and microscopy. The use of cGAN for this purpose is challenging because microscopy images are large in size and typically have low sample availability. The current work demonstrates that the framework reliably synthesizes histology images from MRI scans of corpus callosum, emphasizing the network’s ability to train on high resolution histologies paired with relatively lower-resolution MRI scans. With the ultimate goal of avoiding biopsies, the proposed tool can be used for educational purposes.

[422] SharpXR: Structure-Aware Denoising for Pediatric Chest X-Rays

Ilerioluwakiiye Abolade, Emmanuel Idoko, Solomon Odelola, Promise Omoigui, Adetola Adebanwo, Aondana Iorumbur, Udunna Anazodo, Alessandro Crimi, Raymond Confidence

Main category: eess.IV

TL;DR: SharpXR is a structure-aware dual-decoder U-Net for denoising low-dose pediatric X-rays, preserving diagnostic details and improving pneumonia classification accuracy.

Details

Motivation: Pediatric chest X-rays are crucial for early diagnosis in low-resource settings, but low-dose protocols introduce noise that obscures details. Conventional denoising methods degrade fine features.

Method: SharpXR uses a Laplacian-guided edge-preserving decoder and a learnable fusion module to balance noise suppression and detail retention. Training involves simulated Poisson-Gaussian noise on a pediatric X-ray dataset.

Result: SharpXR outperforms state-of-the-art methods, improving pneumonia classification accuracy from 88.8% to 92.5% while remaining computationally efficient.

Conclusion: SharpXR is effective for denoising pediatric X-rays, enhancing diagnostic accuracy in resource-constrained settings.

Abstract: Pediatric chest X-ray imaging is essential for early diagnosis, particularly in low-resource settings where advanced imaging modalities are often inaccessible. Low-dose protocols reduce radiation exposure in children but introduce substantial noise that can obscure critical anatomical details. Conventional denoising methods often degrade fine details, compromising diagnostic accuracy. In this paper, we present SharpXR, a structure-aware dual-decoder U-Net designed to denoise low-dose pediatric X-rays while preserving diagnostically relevant features. SharpXR combines a Laplacian-guided edge-preserving decoder with a learnable fusion module that adaptively balances noise suppression and structural detail retention. To address the scarcity of paired training data, we simulate realistic Poisson-Gaussian noise on the Pediatric Pneumonia Chest X-ray dataset. SharpXR outperforms state-of-the-art baselines across all evaluation metrics while maintaining computational efficiency suitable for resource-constrained settings. SharpXR-denoised images improved downstream pneumonia classification accuracy from 88.8% to 92.5%, underscoring its diagnostic value in low-resource pediatric care.

[423] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E. Kinahan, Yu Qiao

Main category: eess.IV

TL;DR: VisionUnite is a vision-language foundation model for ophthalmology, trained on extensive datasets, outperforming existing models like GPT-4V and matching junior ophthalmologists in diagnostics.

Details

Motivation: Addressing the lack of diagnostic tools in underdeveloped regions with limited specialist access.

Method: Pretrained on 1.24M image-text pairs and refined with MMFundus dataset (296K fundus images, 889K simulated dialogues).

Result: Outperforms GPT-4V and Gemini Pro, matches junior ophthalmologists in diagnostics, and excels in multi-disease diagnosis, explanations, and patient interaction.

Conclusion: VisionUnite advances ophthalmology diagnostics, education, and disease understanding, with potential for broad clinical and educational impact.

Abstract: The need for improved diagnostic methods in ophthalmology is acute, especially in the underdeveloped regions with limited access to specialists and advanced equipment. Therefore, we introduce VisionUnite, a novel vision-language foundation model for ophthalmology enhanced with clinical knowledge. VisionUnite has been pretrained on an extensive dataset comprising 1.24 million image-text pairs, and further refined using our proposed MMFundus dataset, which includes 296,379 high-quality fundus image-text pairs and 889,137 simulated doctor-patient dialogue instances. Our experiments indicate that VisionUnite outperforms existing generative foundation models such as GPT-4V and Gemini Pro. It also demonstrates diagnostic capabilities comparable to junior ophthalmologists. VisionUnite performs well in various clinical scenarios including open-ended multi-disease diagnosis, clinical explanation, and patient interaction, making it a highly versatile tool for initial ophthalmic disease screening. VisionUnite can also serve as an educational aid for junior ophthalmologists, accelerating their acquisition of knowledge regarding both common and underrepresented ophthalmic conditions. VisionUnite represents a significant advancement in ophthalmology, with broad implications for diagnostics, medical education, and understanding of disease mechanisms. The source code is at https://github.com/HUANGLIZI/VisionUnite.

[424] Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis

Yaqian Chen, Hanxue Gu, Yuwen Chen, Jichen Yang, Haoyu Dong, Joseph Y. Cao, Adrian Camarena, Christopher Mantyh, Roy Colglazier, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: A publicly accessible, end-to-end model for CT body composition analysis is introduced, offering segmentation and feature calculation for skeletal muscle, SAT, and VAT, with high accuracy and broad clinical applicability.

Details

Motivation: The lack of consistent, publicly available tools for CT-based body composition analysis across various clinical applications motivated the development of this model.

Method: The model segments skeletal muscle, SAT, and VAT in axial CT images, calculates body composition metrics, and supports 2D/3D assessments. It was evaluated on internal and external datasets.

Result: High segmentation accuracy (Dice coefficients >89%) and low errors (MRAEs <10%) were achieved, outperforming benchmarks. Muscular fat segmentation was also possible (Dice 56.27%).

Conclusion: The model effectively addresses the gap in publicly available tools for CT body composition analysis, demonstrating high accuracy and utility across diverse clinical applications.

Abstract: Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.40% on skeletal muscle and 10.26% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Furthermore, the model provided muscular fat segmentation with a Dice coefficient of 56.27%, which can be utilized for additional analyses as needed.

[425] PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations

Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet

Main category: eess.IV

TL;DR: PC-SRGAN improves image resolution with physical consistency, outperforming traditional SR methods in metrics like PSNR and SSIM, even with limited data.

Details

Motivation: Address the lack of physical meaningfulness in GAN-generated images for scientific applications.

Method: PC-SRGAN ensures physical consistency by incorporating numerically justified time integrators and advanced quality metrics.

Result: Achieves better PSNR and SSIM than conventional methods, requiring only 13% of training data for comparable performance to SRGAN.

Conclusion: PC-SRGAN enhances scientific machine learning by improving accuracy, efficiency, and interpretability, with potential as a surrogate model for time-dependent problems.

Abstract: Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super-Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional SR methods, even with limited training data (e.g., only 13% of training data is required to achieve performance similar to SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning by improving accuracy and efficiency, enhancing process understanding, and broadening applications to scientific research. We publicly release the complete source code of PC-SRGAN and all experiments at https://github.com/hasan-rakibul/PC-SRGAN.

[426] FUTransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

Akwasi Asare, Mary Sagoe, Justice Williams Asare

Main category: eess.IV

TL;DR: FUTransUNet, a hybrid model combining Vision Transformers and U-Net, improves diabetic foot ulcer segmentation by capturing global context and fine details, achieving high accuracy and clinical interpretability.

Details

Motivation: Automated DFU segmentation is crucial for diagnosis and treatment but is challenging due to ulcer heterogeneity and complex backgrounds. Traditional CNNs lack long-range spatial dependency modeling.

Method: Proposed FUTransUNet integrates Vision Transformers’ global attention with U-Net’s localization, using skip connections and decoding for fine resolution. Trained on the FUSeg dataset.

Result: Achieved high Dice Coefficient (0.8751) and IoU (0.7780) on validation, with low loss (0.009045). Grad-CAM visualizations ensured clinical transparency.

Conclusion: FUTransUNet offers a robust, accurate, and interpretable solution for DFU segmentation, enhancing wound assessment and patient care.

Abstract: Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we propose FUTransUNet, a hybrid architecture that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net framework. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution through skip connections and an effective decoding pathway. We trained and validated FUTransUNet on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset. FUTransUNet achieved a training Dice Coefficient of 0.8679, an IoU of 0.7672, and a training loss of 0.0053. On the validation set, the model achieved a Dice Coefficient of 0.8751, an IoU of 0.7780, and a validation loss of 0.009045. To ensure clinical transparency, we employed Grad-CAM visualizations, which highlighted model focus areas during prediction. These quantitative outcomes clearly demonstrate that our hybrid approach successfully integrates global and local feature extraction paradigms, thereby offering a highly robust, accurate, explainable, and interpretable solution and clinically translatable solution for automated foot ulcer analysis. The approach offers a reliable, high-fidelity solution for DFU segmentation, with implications for improving real-world wound assessment and patient care.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models

[2] TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection

[3] Heartificial Intelligence: Exploring Empathy in Language Models

[4] Real-time News Story Identification

[5] TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning

[6] Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition

[7] MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

[8] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

[9] Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models

[10] Objective Metrics for Evaluating Large Language Models Using External Data Sources

[11] The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

[12] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

[13] Putnam-AXIOM: A Functional and Static Benchmark

[14] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

[15] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

[16] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

[17] Marco-Voice Technical Report

[18] Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints

[19] Momentum Point-Perplexity Mechanics in Large Language Models

[20] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

[21] DeCAL Tokenwise Compression

[22] DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives

[23] Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review

[24] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

[25] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

[26] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

[27] LLaMA-Based Models for Aspect-Based Sentiment Analysis

[28] UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection

[29] Prompt-Based Approach for Czech Sentiment Analysis

[30] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement

[31] TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

[32] Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

[33] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

[34] IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization

[35] Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation

[36] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

[37] DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation

[38] Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

[39] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

[40] TiMoE: Time-Aware Mixture of Language Experts

[41] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

[42] Steering Towards Fairness: Mitigating Political Bias in LLMs

[43] BiasGym: Fantastic Biases and How to Find (and Remove) Them

[44] Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance

[45] Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

[46] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

[47] Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

[48] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

[49] Train Long, Think Short: Curriculum Learning for Efficient Reasoning

[50] Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens

[51] Retrospective Sparse Attention for Efficient Long-Context Generation

[52] LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA

[53] A Survey on Training-free Alignment of Large Language Models

[54] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

[55] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

[56] READER: Retrieval-Assisted Drafter for Efficient LLM Inference

[57] CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

[58] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

[59] Link Prediction for Event Logs in the Process Industry

[60] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

[61] SinLlama – A Large Language Model for Sinhala

[62] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

[63] Complex Logical Instruction Generation

[64] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

[65] Quantifying Gender Biases Towards Politicians on Reddit

[66] Utilizing Large Language Models for Information Extraction from Real Estate Transactions

[67] From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

[68] AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models

[69] EvoP: Robust LLM Inference via Evolutionary Pruning

[70] Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

[71] Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

[72] Opioid Named Entity Recognition (ONER-2025) from Reddit

[73] CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

[74] ChatBench: From Static Benchmarks to Human-AI Evaluation

[75] Retrieval-Augmented Generation with Conflicting Evidence

[76] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

[77] Mind the Gap: Benchmarking LLM Uncertainty, Discrimination, and Calibration in Specialty-Aware Clinical QA