Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 87]
cs.CV [Total: 115]
cs.AI [Total: 56]
cs.SD [Total: 17]
cs.LG [Total: 112]
cs.MA [Total: 1]
cs.MM [Total: 0]
eess.AS [Total: 11]
eess.IV [Total: 8]

cs.CL

[1] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Qi Chen, Jingxuan Wei, Zhuoya Yao, Haiguang Wang, Gaowei Wu, Bihui Yu, Siyuan Li, Cheng Tan

Main category: cs.CL

TL;DR: ResearchPulse is an agent-based framework for multi-document scientific inference that extracts and aligns motivation, methods, and results across related papers to reconstruct research development chains.

Details

Motivation: Understanding scientific evolution requires cross-document reasoning over thematically related research, not just summarizing individual papers. Current approaches lack structured alignment of motivation, methodology, and experimental results across multiple documents.

Method: ResearchPulse uses three coordinated agents: Plan Agent for task decomposition, Mmap-Agent for constructing motivation-method mind maps, and Lchart-Agent for synthesizing experimental line charts. The framework handles challenges like temporal alignment of methods and standardization of heterogeneous experimental tables.

Result: Experiments show ResearchPulse outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity, despite using 7B-scale agents. A citation-aware benchmark (ResearchPulse-Bench) was created to support this task.

Conclusion: The work formalizes multi-document scientific inference as a new task and demonstrates that agent-based frameworks can effectively reconstruct research development chains by extracting and aligning scientific content across related papers.

Abstract: Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.

[2] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori

Main category: cs.CL

TL;DR: Study compares LLM adaptation strategies for dementia detection using speech data, finding that demonstration selection, reasoning prompts, and fine-tuning methods significantly impact performance, with properly adapted open-weight models matching commercial systems.

Details

Motivation: Over half of US adults with dementia remain undiagnosed, and speech-based screening offers a scalable detection approach that needs effective AI model adaptation strategies.

Method: Evaluated nine text-only and three multimodal audio-text models on DementiaBank speech corpus using various adaptations: in-context learning with demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration.

Result: Class-centroid demonstrations achieved highest in-context learning performance, reasoning improved smaller models, token-level fine-tuning produced best scores. Multimodal models performed well but didn’t surpass top text-only models.

Conclusion: Model adaptation strategies critically influence dementia detection performance, and properly adapted open-weight models can match or exceed commercial systems for scalable speech-based screening.

Abstract: Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[3] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu, Jiateng Li, Yuan Liu

Main category: cs.CL

TL;DR: RBA framework uses self-synthesis data and reinforcement learning to improve speech-based LLMs’ instruction-following, outperforming text LLMs and achieving SOTA results without human annotations.

Details

Motivation: Speech-based LLMs suffer performance gaps compared to text LLMs due to inter-modal discrepancies, especially with dynamic user speech inputs.

Method: Reinforced Behavior Alignment (RBA) uses self-synthesis methodology with a teacher LLM to generate alignment data, then aligns SpeechLMs using reinforcement learning instead of supervised fine-tuning.

Result: RBA effectively enhances instruction-following capabilities, outperforms conventional distillation baselines, and achieves state-of-the-art performance on spoken QA and speech-to-text translation benchmarks.

Conclusion: RBA framework successfully bridges performance gap between speech and text LLMs using self-generated data and reinforcement learning, demonstrating strong generalization across speech tasks.

Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[4] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

Bohdan M. Pavlyshenko

Main category: cs.CL

TL;DR: Fine-tuned Mistral 7B LLM with RAG performs multilevel cryptocurrency news analysis, generating graph/text summaries with sentiment scores and JSON representations to reduce hallucinations and provide comprehensive insights.

Details

Motivation: To address cryptocurrency news analysis challenges and eliminate LLM hallucinations by creating complementary graph and text-based summaries through hierarchical analysis.

Method: Uses fine-tuned Mistral 7B model with 4-bit quantization via PEFT/LoRA approach and retrieval-augmented generation (RAG) to generate graph summaries, text summaries with sentiment scores, and JSON representations in a hierarchical stacking framework.

Result: The approach successfully conducts informative qualitative and quantitative analytics, providing important insights while essentially eliminating problems with large language model hallucinations through knowledge graph representation.

Conclusion: Fine-tuned Mistral 7B LLM with RAG is effective for multilevel cryptocurrency news analysis, offering complementary graph and text views that provide comprehensive reporting capabilities.

Abstract: In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights.

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

Alejandro Álvarez Castro, Joaquín Ordieres-Meré

Main category: cs.CL

TL;DR: Novel multi-modal framework for earnings calls using hierarchical discourse trees with text, audio, and video emotional signals, achieving semantically rich embeddings for financial forecasting and other high-stakes communication domains.

Details

Motivation: Existing financial sentiment analysis systems fail to capture the layered discourse structure of earnings calls, which blend scripted managerial commentary with unscripted analyst dialogue, requiring a more sophisticated approach.

Method: Two-stage transformer architecture: first encodes multi-modal content (text, audio, video) and discourse metadata at node level using contrastive learning, second synthesizes global embedding for entire conference through hierarchical discourse trees.

Result: Embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment, demonstrating practical utility for financial forecasting and discourse evaluation.

Conclusion: The framework generalizes beyond financial reporting to other high-stakes unscripted communicative domains like tele-medicine, education, and political discourse, offering a robust and explainable multi-modal discourse representation approach.

Abstract: Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

[6] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

Matilde Contestabile, Chiara Ferrara, Alberto Giovannetti, Giovanni Parrillo, Andrea Vandin

Main category: cs.CL

TL;DR: ProLiFIC is a comprehensive event log of Italian lawmaking process (1987-2022) created using LLMs from unstructured data, serving as a benchmark for legal process mining.

Details

Motivation: Process mining's application in legal systems is limited by dataset accessibility and quality issues, requiring better data resources for effective analysis.

Method: Created ProLiFIC event log from unstructured Normattiva portal data using large language models (LLMs) for structuring and processing.

Result: Developed a comprehensive benchmark dataset for Italian lawmaking process spanning 35 years, enabling preliminary process mining analyses.

Conclusion: ProLiFIC serves as a valuable benchmark for legal process mining and demonstrates successful integration of LLMs with process mining techniques.

Abstract: Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM’s efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.

[7] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan

Main category: cs.CL

TL;DR: AgenTracer is an automated framework for diagnosing failures in multi-agent LLM systems, using counterfactual replay and fault injection to create training data, and achieves state-of-the-art performance in failure attribution.

Details

Motivation: Multi-agent LLM systems are complex and fragile, making error diagnosis difficult. Current LLMs perform poorly (<10% accuracy) at identifying which agent or step causes failures in long execution traces.

Method: Proposes AgenTracer framework with automated annotation of failed trajectories via counterfactual replay and programmed fault injection. Creates TracerTraj dataset and trains AgenTracer-8B model using multi-granular reinforcement learning.

Result: AgenTracer-8B outperforms giant proprietary LLMs (Gemini-2.5-Pro, Claude-4-Sonnet) by up to 18.18% on Who&When benchmark. Delivers 4.8-14.2% performance gains to multi-agent systems like MetaGPT and MaAS.

Conclusion: AgenTracer sets new standard in LLM agentic failure attribution and enables self-correcting, self-evolving agentic AI systems through actionable feedback.

Abstract: Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.

Paul Blum, Enrico Liscio, Ruixuan Zhang, Caroline Figueroa, Pradeep K. Murukannaiah

Main category: cs.CL

TL;DR: Early-SIB: Transformer model predicts adolescent suicidal ideation/behavior from forum posts before explicit disclosure, achieving 0.73 balanced accuracy on Dutch youth forum data.

Details

Motivation: Suicide is a leading cause of adolescent death, but many cases go undetected due to lack of mental health contact. Social media provides real-time insights into young people's struggles.

Method: Transformer-based model that sequentially processes user’s posts and engagement content to predict future suicidal ideation/behavior without using self-disclosure as input.

Result: Achieved 0.73 balanced accuracy for predicting future SIB on a Dutch youth forum dataset.

Conclusion: Social media-based predictive tools can meaningfully supplement traditional suicide prevention methods by detecting risk before explicit disclosure.

Abstract: Suicide is a leading cause of death among adolescents (12-18), yet predicting it remains a significant challenge. Many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as young people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) from forum posts before an adolescent explicitly expresses suicidal ideation on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. To this end, we introduce Early-SIB, a transformer-based model that sequentially processes the posts a user writes and engages with to predict whether they will write a SIB post. Our model achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods.

[9] Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, Neel Nanda

Main category: cs.CL

TL;DR: A scalable method for real-time detection of entity-level hallucinations in long-form LLM outputs using web-search-annotated data and simple linear classifiers that outperform existing approaches.

Details

Motivation: Hallucinations in large language models can cause serious harm in high-stakes applications like medical and legal domains, but existing detection methods are impractical for real-world use due to limitations with long-form content and high verification costs.

Method: Developed an annotation methodology using web search to label model responses with grounded entity-level hallucinations, then trained efficient linear probe classifiers on this data to detect fabricated entities in real-time.

Result: Classifiers consistently outperformed baselines across four model families (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), worked well on both long-form and short-form responses, and generalized to mathematical reasoning tasks despite being trained only on entity-level labels.

Conclusion: The approach provides a promising scalable solution for real-world hallucination detection, with datasets publicly released to facilitate reuse across different models.

Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} – e.g., fabricated names, dates, citations – rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

[10] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Igor Halperin

Main category: cs.CL

TL;DR: Developed UDIB, an entropy-regularized clustering method based on Deterministic Information Bottleneck, to improve detection of LLM confabulations by creating more informative topic representations.

Details

Motivation: Current Semantic Divergence Metrics for detecting LLM hallucinations rely on geometric clustering optimized for spatial proximity rather than information-theoretic analysis, creating a disconnect in topic identification.

Method: Transformed DIB method into practical algorithm by replacing intractable KL divergence with computationally efficient upper bound, creating UDIB - an entropy-regularized version of K-means that favors parsimonious, informative clusters.

Result: UDIB generates shared topic representations that are fundamentally structured to be maximally informative about prompt-response relationships, providing superior foundation for SDM framework.

Conclusion: UDIB offers a novel, more sensitive tool for detecting LLM confabulations by bridging the gap between geometric clustering and information-theoretic analysis through principled topic identification.

Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.

[11] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu

Main category: cs.CL

TL;DR: VoxRole is the first comprehensive benchmark for evaluating speech-based role-playing conversational agents, addressing the lack of standardized evaluation and paralinguistic feature consideration in current RPCA research.

Details

Motivation: Current RPCA research focuses only on textual modality and overlooks critical paralinguistic features like intonation and prosody. There's also a lack of standardized evaluation benchmarks for speech-based role-playing, with existing datasets failing to quantify core competencies like long-term persona consistency.

Method: Created VoxRole benchmark with 13,335 multi-turn dialogues (65.6 hours of speech) from 1,228 characters across 261 movies. Used a novel two-stage automated pipeline: first aligning movie audio with scripts, then employing LLMs to build multi-dimensional character profiles.

Result: The benchmark enables multi-dimensional evaluation of contemporary spoken dialogue models, revealing their respective strengths and limitations in maintaining persona consistency.

Conclusion: VoxRole addresses critical gaps in speech-based RPCA evaluation by providing standardized benchmarks that incorporate paralinguistic features and enable comprehensive assessment of persona consistency in spoken dialogue systems.

Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.

[12] QuesGenie: Intelligent Multimodal Question Generation

Ahmed Mubarak, Amna Ahmed, Amira Nasser, Aya Mohamed, Fares El-Sadek, Mohammed Ahmed, Ahmed Salah, Youssef Sobhy

Main category: cs.CL

TL;DR: A multi-modal question generation system that automatically creates diverse question types from various content formats to address the lack of practice materials for educational resources.

Details

Motivation: Learners have abundant educational resources but lack tailored practice materials, creating a significant challenge in modern education.

Method: Four-component system: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and end-to-end interactive interface.

Result: Developed foundation for automated, scalable, and intelligent question generation that balances resource efficiency, robust functionality, and smooth user experience.

Conclusion: The system successfully addresses the gap in educational practice materials by providing automated question generation from diverse content formats.

Abstract: In today’s information-rich era, learners have access to abundant educational resources, but the lack of practice materials tailored to these resources presents a significant challenge. This project addresses that gap by developing a multi-modal question generation system that can automatically generate diverse question types from various content formats. The system features four major components: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and an end-to-end interactive interface. This project lays the foundation for automated, scalable, and intelligent question generation, carefully balancing resource efficiency, robust functionality and a smooth user experience.

[13] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Cheng-Kai Yeh, Hsing-Wang Lee, Chung-Hung Kuo, Hen-Hsen Huang

Main category: cs.CL

TL;DR: AR^2 framework uses adversarial reinforcement learning to train LLMs for better abstraction skills by having a teacher model create complex narrative problems from kernel problems while a student model learns to extract the underlying computational logic.

Details

Motivation: Existing LLM training approaches focus on superficial pattern recognition but overlook explicit training for abstraction, which is a foundational skill in computer science critical for problem-solving and generalization.

Method: AR^2 employs a teacher model to transform kernel problems into narrative-rich descriptions without changing their fundamental logic, while a student coding model is trained to solve these complex problems by extracting their underlying computational kernels using adversarial reinforcement learning.

Result: Experimental results show that AR^2 substantially improves the student model’s accuracy on previously unseen, challenging programming tasks.

Conclusion: Abstraction is a key skill for enhancing LLM generalization, and the AR^2 framework successfully enhances abstraction abilities in large language models for code generation.

Abstract: Abstraction–the ability to recognize and distill essential computational patterns from complex problem statements–is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model’s accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.

[14] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

Main category: cs.CL

TL;DR: PARCO improves contextual ASR by addressing homophone recognition issues through phoneme-aware encoding, contrastive entity disambiguation, and hierarchical filtering, achieving significant performance gains on both Chinese and English datasets.

Details

Motivation: Current ASR systems struggle with domain-specific named entities and homophones, with contextual ASR methods having limited entity diversity and treating entities as independent tokens, leading to incomplete multi-token biasing.

Method: Proposes PARCO framework with four key components: phoneme-aware encoding for phonetic discrimination, contrastive entity disambiguation, entity-level supervision for complete entity retrieval, and hierarchical entity filtering to reduce false positives.

Result: Achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. Also shows robust gains on out-of-domain datasets THCHS-30 and LibriSpeech.

Conclusion: PARCO effectively addresses homophone recognition challenges in contextual ASR through integrated phonetic and entity-level processing, demonstrating superior performance across multiple languages and domains.

Abstract: Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

[15] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu

Main category: cs.CL

TL;DR: A framework that dynamically constructs and expands knowledge graphs during inference to improve LLM factuality by integrating internal knowledge and external retrieval.

Details

Motivation: LLMs struggle with factual consistency due to parametric memory limitations. RAG methods treat knowledge as unstructured text, limiting compositional reasoning and factual inconsistency detection.

Method: Extracts seed KG from question via prompting, iteratively expands using LLM’s latent knowledge, then selectively refines through external retrieval to enhance factual coverage and correct inaccuracies.

Result: Consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods across three factual QA benchmarks.

Conclusion: Inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM’s latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

[16] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Josh Wisoff, Yao Tang, Zhengyu Fang, Jordan Guzman, YuTang Wang, Alex Yu

Main category: cs.CL

TL;DR: NoteBar is an AI-assisted note-taking tool that uses persona information and efficient language models to automatically organize notes, accompanied by a novel persona-conditioned dataset for research evaluation.

Details

Motivation: Existing AI-assisted note-taking solutions often struggle with efficiency, and there's a need for better tools that can capture, organize, and reflect on information effectively in academic and professional settings.

Method: Leverages persona information and efficient language models to automatically organize notes into multiple categories. Introduces a persona-conditioned dataset of 3,173 notes with 8,494 annotated concepts across 16 MBTI personas.

Result: NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without heavy infrastructure. The tool and dataset provide scalable foundation for AI-assisted personal knowledge management.

Conclusion: NoteBar and its accompanying dataset offer a scalable and extensible foundation for advancing AI-assisted personal knowledge management, addressing efficiency challenges in existing solutions.

Abstract: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.

[17] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Aryan Gupta, Anupam Purwar

Main category: cs.CL

TL;DR: Traditional OCR systems like Sprinklr-Edge-OCR outperform LVLMs in edge deployment due to superior efficiency, lower cost, and better F1 scores despite LVLMs having higher precision.

Details

Motivation: To address the challenge of OCR in multilingual, noisy real-world images and evaluate whether LVLMs can outperform traditional OCR systems in resource-constrained edge environments.

Method: Large-scale comparative evaluation of 5 LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and 2 traditional OCR systems on a proprietary multilingual dataset with 54 languages, covering accuracy, semantic consistency, language coverage, computational efficiency, and deployment costs.

Result: Qwen achieved highest precision (0.54) but Sprinklr-Edge-OCR delivered best overall F1 score (0.46), was 35x faster (0.17s per image), and cost less than 0.01 of LVLMs (0.006 USD per 1,000 images).

Conclusion: Traditional OCR systems are more optimal for edge deployment than LVLMs due to significantly lower compute requirements, lower latency, and much higher affordability, even in the LLM era.

Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

[18] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer

Main category: cs.CL

TL;DR: Lightweight steering vectors can reduce LLM self-preference bias by up to 97% without retraining, but show instability on legitimate self-preference cases.

Details

Motivation: Large language models suffer from self-preference bias where they favor their own outputs over other models, undermining fairness in evaluation pipelines for tasks like preference tuning and model routing.

Method: Used Contrastive Activation Addition (CAA) and optimization-based approach to construct steering vectors, tested on a curated dataset distinguishing justified vs unjustified self-preference.

Result: Steering vectors reduced unjustified self-preference bias by up to 97%, outperforming prompting and direct preference optimization baselines, but were unstable on legitimate self-preference and unbiased agreement.

Conclusion: Steering vectors show promise but have limits as safeguards for LLM-as-judges, motivating the need for more robust interventions as self-preference spans multiple or nonlinear directions.

Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from “self-preference bias”: a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

[19] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

Ali Noori, Somya Mohanty, Prashanti Manda

Main category: cs.CL

TL;DR: Study analyzes SNOMED CT concept relationships in clinical notes using co-occurrence patterns and semantic embeddings, finding weak correlation but complementary value for improving documentation and uncovering clinical insights.

Details

Motivation: Clinical notes contain rich but unstructured data; standardized terminologies like SNOMED CT improve interoperability but understanding concept relationships through co-occurrence and semantic similarity remains underexplored.

Method: Leveraged MIMIC-IV database, used Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (ClinicalBERT, BioBERT) to analyze concept co-occurrence patterns and semantic similarity, examined temporal evolution and specialty differences.

Result: Weak correlation between co-occurrence and semantic similarity; embeddings captured clinically meaningful associations not reflected in frequency; embedding suggestions matched later-documented concepts; clustering revealed coherent clinical themes; co-occurrence patterns linked to outcomes like mortality and readmission.

Conclusion: Co-occurrence statistics and semantic embeddings provide complementary value for improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.

Abstract: Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.

[20] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

Parush Gera, Tempestt Neal

Main category: cs.CL

TL;DR: MLSD uses metric learning with triplet loss for cross-domain/target stance detection, creating a discriminative embedding space that improves performance across multiple models and datasets.

Details

Motivation: To address the challenge of stance detection across different domains and targets where labeled data is scarce, requiring effective domain adaptation techniques.

Method: Utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, constructing a discriminative embedding space for better domain adaptation.

Result: Shows statistically significant improvement in stance detection performance across six widely used stance detection models in multiple cross-target and cross-domain scenarios.

Conclusion: MLSD effectively enhances cross-domain and cross-target stance detection by leveraging metric learning to create transferable semantic representations.

Abstract: We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.

[21] NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task

Bashar Talafha, Hawau Olamide Toyin, Peter Sullivan, AbdelRahim Elmadany, Abdurrahman Juma, Amirbek Djanibekov, Chiyu Zhang, Hamad Alshehhi, Hanan Aldarmaki, Mustafa Jarrar, Nizar Habash, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: NADI 2025 shared task results on Arabic speech dialect processing with 44 teams, 100 submissions across 3 subtasks, showing best performance of 79.8% accuracy in dialect ID and ongoing challenges in speech recognition/diacritic restoration.

Details

Motivation: To advance Arabic dialect speech processing through community collaboration and benchmark performance across dialect identification, speech recognition, and diacritic restoration tasks.

Method: Organized shared task with three subtasks: spoken dialect identification, speech recognition, and diacritic restoration for spoken dialects. Collected submissions from 44 registered teams with 100 valid submissions from 8 unique teams.

Result: Best systems achieved: 79.8% accuracy on dialect identification, 35.68/12.20 WER/CER on speech recognition, and 55/13 WER/CER on diacritic restoration. Results show significant challenges remain in Arabic dialect speech processing.

Conclusion: Arabic dialect speech processing remains challenging, particularly in recognition and diacritic restoration. The shared task successfully engaged the community and identified areas for future research and improvement in NADI editions.

Abstract: We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 “five teams{\ae}, 47 submissions for Subtask 2 “six teams”, and 19 submissions for Subtask 3 “two teams”. The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.

[22] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani

Main category: cs.CL

TL;DR: SiLVERScore is a new embedding-based metric for evaluating sign language generation that directly assesses semantic similarity in joint embedding space, overcoming limitations of back-translation methods.

Details

Motivation: Current back-translation evaluation methods for sign language generation fail to capture multimodal aspects (facial expressions, spatial grammar, prosody) and introduce ambiguity in error attribution between generation and translation systems.

Method: Proposed SiLVERScore, a semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space, enabling direct semantic comparison without back-translation.

Result: On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

Conclusion: SiLVERScore provides a robust, semantically-aware evaluation framework for sign language generation that addresses the limitations of existing back-translation methods and effectively captures semantic and prosodic variations.

Abstract: Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

[23] Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani

Main category: cs.CL

TL;DR: VLMs claim reasoning but current benchmarks don’t test interactive grounding. New 4-metric suite evaluates VLM performance in interactive referential games, showing models diverge from human patterns despite task success.

Details

Motivation: Current benchmarks evaluate VLMs in single-turn or QA settings, but grounding is an interactive process where people develop shared understanding through ongoing communication.

Method: Introduced a four-metric suite (grounding efficiency, content alignment, lexical adaptation, human-likeness) and deployed it on 150 self-play sessions of interactive referential games between three proprietary VLMs, comparing with human dyads.

Result: All three models diverged from human patterns on at least three metrics (GPT4o-mini was closest overall). Task success scores don’t indicate successful grounding, and high image-utterance alignment doesn’t necessarily predict task success.

Conclusion: The metric suite and findings provide a framework for future research on VLM grounding, highlighting the need for better evaluation of interactive reasoning capabilities.

Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

[24] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Jiaxin Guo, Daimeng Wei, Yuanchang Luo, Xiaoyu Chen, Zhanglin Wu, Huan Yang, Hengchao Shang, Zongyao Li, Zhiqiang Rao, Jinlong Yang, Hao Yang

Main category: cs.CL

TL;DR: Align-then-Slide is a new evaluation framework for document-level machine translation that handles ultra-long documents by first aligning source-target sentences and then using multi-chunk sliding evaluation, achieving high correlation with human judgments and enabling effective model training.

Details

Motivation: Existing evaluation methods for machine translation assume sentence-by-sentence alignment, but large language models produce whole-document outputs that challenge these methods, requiring a new evaluation framework that can handle ultra-long documents and complex alignment issues.

Method: The framework has two stages: 1) Align stage - automatically infer sentence-level source-target correspondences and rebuild target to match source sentence number, resolving omissions and many-to-one/one-to-many mappings; 2) n-Chunk Sliding Evaluate stage - calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment.

Result: Experiments on WMT benchmark show Pearson correlation of 0.929 with expert MQM rankings. On a newly curated real-world test set, the method aligns closely with human judgments. The framework also enables effective CPO training and can be used as a reward model for GRPO, yielding translations preferred over vanilla SFT baseline.

Conclusion: The results validate Align-then-Slide as an accurate, robust, and actionable evaluation tool for document-level machine translation systems that effectively handles the challenges posed by large language model outputs.

Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.

[25] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian, Rui Liu, Berrak Sisman, Haizhou Li

Main category: cs.CL

TL;DR: NE-PADD is a novel method for partial audio deepfake detection that leverages named entity knowledge through parallel SpeechNER and PADD branches with attention fusion and transfer mechanisms.

Details

Motivation: Traditional audio deepfake detection operates at sentence level, while partial detection requires frame-level fake speech localization. Semantic information from audio, particularly named entities, remains underexplored for this task.

Method: Proposes NE-PADD with two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. Uses Attention Fusion (AF) to combine attention weights and Attention Transfer (AT) with auxiliary loss to guide PADD using named entity semantics.

Result: Experiments on PartialSpoof-NER dataset show the method outperforms existing baselines, demonstrating effectiveness of integrating named entity knowledge in partial audio deepfake detection.

Conclusion: The integration of named entity knowledge through attention mechanisms significantly improves partial audio deepfake detection performance, proving the value of semantic information for frame-level fake speech localization.

Abstract: Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.

[26] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin

Main category: cs.CL

TL;DR: Drivelology is nonsense with depth - syntactically coherent but pragmatically paradoxical text that LLMs fail to understand despite surface fluency.

Details

Motivation: To investigate LLMs' limitations in understanding layered semantic meaning beyond surface coherence, particularly in paradoxical or emotionally loaded text.

Method: Created a diverse benchmark of 1,200+ curated Drivelology examples across multiple languages, with expert annotation and adjudication. Evaluated LLMs on classification, generation, and reasoning tasks.

Result: LLMs consistently fail to grasp Drivelology, confusing it with shallow nonsense, producing incoherent justifications, and missing implied rhetorical functions.

Conclusion: Statistical fluency doesn’t imply cognitive comprehension; LLMs have a representational gap in pragmatic understanding that challenges current NLP assumptions.

Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth”, utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[27] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

Yanbo Wang, Yongcan Yu, Jian Liang, Ran He

Main category: cs.CL

TL;DR: Survey paper analyzing how Chain-of-Thought reasoning affects LLM trustworthiness across five dimensions: truthfulness, safety, robustness, fairness, and privacy, finding that while CoT improves some aspects, it introduces new vulnerabilities.

Details

Motivation: To provide a comprehensive understanding of how CoT-based reasoning affects language model trustworthiness, as this area remains underdeveloped despite CoT's advancements in performance.

Method: Survey and analysis of recent work on reasoning models and CoT techniques, organized chronologically across five trustworthiness dimensions with detailed methodology and limitation analyses.

Result: Reasoning techniques enhance trustworthiness through hallucination mitigation and robustness improvement, but cutting-edge reasoning models suffer from comparable or greater vulnerabilities in safety, robustness, and privacy.

Conclusion: This work serves as a valuable resource for the AI safety community to stay informed on reasoning trustworthiness progress, highlighting both benefits and vulnerabilities of CoT techniques.

Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{https://github.com/ybwang119/Awesome-reasoning-safety}{https://github.com/ybwang119/Awesome-reasoning-safety}.

[28] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen

Main category: cs.CL

TL;DR: Probing-based safety detection methods in LLMs fail because they learn superficial patterns like instructional formats and trigger words rather than semantic harmfulness, creating a false sense of security.

Details

Motivation: To systematically examine why probing-based approaches for detecting harmful instructions in LLMs perform poorly out-of-distribution and may create false security.

Method: Conducted controlled experiments including: comparing n-gram methods with probing, using semantically cleaned datasets, and detailed pattern dependency analysis to identify what probes actually learn.

Result: Probes learn superficial patterns (instructional patterns and trigger words) rather than semantic harmfulness, explaining poor out-of-distribution performance and questioning current safety detection approaches.

Conclusion: Current probing-based safety approaches provide a false sense of security; both models and evaluation protocols need redesign for responsible safety research in LLMs.

Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs’ internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

[29] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian

Main category: cs.CL

TL;DR: MobileRAG is a mobile agent framework enhanced by Retrieval-Augmented Generation (RAG) that addresses limitations of current LLM-based mobile agents by improving accuracy, enabling external environment interaction, and adding memory capabilities.

Details

Motivation: Current LLM-based mobile agents suffer from three main issues: heavy reliance on LLM comprehension leading to errors, lack of interaction with external environments causing task termination, and absence of memory capabilities requiring task reconstruction for each instruction.

Method: Proposes MobileRAG framework with three components: InterRAG, LocalRAG, and MemRAG. Uses Retrieval-Augmented Generation to quickly identify user queries and accomplish complex mobile tasks. Also introduces MobileRAG-Eval benchmark for comprehensive evaluation.

Result: Extensive experiments on MobileRAG-Eval show 10.3% improvement over state-of-the-art methods with fewer operational steps, demonstrating effective handling of real-world mobile tasks.

Conclusion: MobileRAG successfully addresses the limitations of current mobile agents by leveraging RAG technology, providing a more robust and efficient framework for complex mobile task automation.

Abstract: Smartphones have become indispensable in people’s daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv

[30] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao

Main category: cs.CL

TL;DR: Matrix of Thought (MoT) framework enhances LLM reasoning for complex QA through multi-dimensional thinking and fact-correction mechanisms, achieving state-of-the-art performance with 14.4% of baseline reasoning time.

Details

Motivation: Address limitations of existing methods like Chain-of-Thought and Tree-of-Thought which suffer from redundancy and single-path reasoning, and improve RAG's effectiveness with multi-entity, multi-hop information.

Method: Proposes Matrix of Thought (MoT) with column-cell communication for horizontal and vertical exploration, plus fact-correction mechanism using knowledge units from KG triples and raw text.

Result: Outperforms SOTA methods on four datasets in F1 and EM scores, with reasoning time reduced to only 14.4% of baseline methods.

Conclusion: MoT framework enables efficient and accurate complex QA through multi-dimensional reasoning and enhanced knowledge utilization, demonstrating both performance and efficiency improvements.

Abstract: Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs’ reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the “column-cell communication” mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.

[31] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

Iro Lim, Haein Ji, Byungjun Kim

Main category: cs.CL

TL;DR: KPoEM dataset enables emotion analysis in Korean poetry using fine-tuned language models, achieving significant performance improvements over general models.

Details

Motivation: Korean poetry remains underexplored in computational emotion analysis due to its figurative language and cultural specificity, creating a gap in literary computational analysis.

Method: Created multi-label emotion dataset with 7,662 entries (line-level and work-level) annotated with 44 fine-grained emotion categories, then fine-tuned Korean language models through sequential training on general corpora and the KPoEM dataset.

Result: Fine-tuned model achieved 0.60 F1-micro score, significantly outperforming previous models (0.34 F1-micro) trained on general corpora, demonstrating enhanced ability to identify culturally specific emotional expressions.

Conclusion: The study bridges computational methods and literary analysis, enabling quantitative exploration of poetic emotions while preserving Korean cultural nuances through structured data.

Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.

[32] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen

Main category: cs.CL

TL;DR: SelfAug is a self-distribution alignment method that mitigates catastrophic forgetting in LLM fine-tuning by aligning input sequence logits to preserve the model’s original semantic distribution, achieving better balance between downstream performance and general capability retention.

Details

Motivation: Supervised fine-tuning in RAG scenarios often causes catastrophic forgetting where models lose previously acquired knowledge and general capabilities. Existing solutions have limitations in preserving the model's original distribution or require access to general instruction data.

Method: Proposes SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model’s semantic distribution during fine-tuning, preventing distribution shifts that cause catastrophic forgetting.

Result: Extensive experiments show SelfAug achieves superior balance between downstream learning and general capability retention. Empirical analysis reveals direct correlation between distribution shifts and catastrophic forgetting severity in RAG scenarios.

Conclusion: SelfAug provides a practical solution to mitigate catastrophic forgetting in RAG fine-tuning contexts, advancing understanding of distribution shifts and offering applicability across diverse fine-tuning scenarios.

Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model’s original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model’s semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.

[33] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Yuhao Zhang, Shaoming Duan, Jinhang Su, Chuanyi Liu, Peiyi Han

Main category: cs.CL

TL;DR: SPFT-SQL improves self-play fine-tuning for Text-to-SQL by adding verification-based iterative fine-tuning and error-driven loss to handle opponent model’s incorrect outputs.

Details

Motivation: SPIN faces challenges in Text-to-SQL because it doesn't generate new information and correct SQL queries from opponent models reduce the main model's accuracy.

Method: Proposes SPFT-SQL with two phases: 1) verification-based iterative fine-tuning to create high-quality data and build model base, 2) self-play with error-driven loss that incentivizes learning from opponent’s incorrect outputs.

Result: Outperforms state-of-the-art methods on six LLMs and five benchmarks through extensive experiments.

Conclusion: SPFT-SQL effectively addresses SPIN’s limitations in Text-to-SQL by incorporating verification and error-driven learning, achieving superior performance.

Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model’s ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

[34] CANDY: Benchmarking LLMs’ Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Ruiling Guo, Xinwei Yang, Chen Huang, Tong Zhang, Yong Hu

Main category: cs.CL

TL;DR: CANDY benchmark evaluates LLMs for Chinese misinformation fact-checking, finding current models unreliable alone but potentially useful as assistive tools.

Details

Motivation: To systematically assess LLMs' capabilities and limitations in fact-checking Chinese misinformation, as their effectiveness remains uncertain despite growing use.

Method: Created CANDY benchmark with ~20k annotated instances, tested LLMs with chain-of-thought reasoning and few-shot prompting, developed taxonomy to categorize flawed explanations.

Result: Current LLMs show limitations in generating accurate fact-checking conclusions, with factual fabrication being the most common failure mode. LLMs alone are unreliable for fact-checking.

Conclusion: While LLMs are not reliable standalone fact-checkers, they have considerable potential to augment human performance when used as assistive tools in fact-checking scenarios.

Abstract: The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY

[35] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Ulin Nuha, Adam Jatowt

Main category: cs.CL

TL;DR: This paper creates synthetic NLP datasets for the endangered Ladin language by translating Italian data, improving machine translation performance and providing the first publicly available sentiment analysis and MCQA resources for Ladin.

Details

Motivation: LLMs perform poorly on low-resource languages like Ladin due to lack of labeled data. There's limited high-quality NLP datasets for indigenous languages, making it difficult to develop language technologies for these underrepresented languages.

Method: Used parallel Ladin-Italian sentence pairs to create synthetic datasets for sentiment analysis and MCQA by translating monolingual Italian data. Applied rigorous filtering and back-translation procedures to ensure linguistic quality and reliability.

Result: Incorporating synthetic datasets into machine translation training led to substantial improvements over existing Italian-Ladin translation baselines. Created the first publicly available sentiment analysis and MCQA datasets for Ladin.

Conclusion: The approach successfully addresses the data scarcity problem for endangered languages by creating synthetic datasets, establishing foundational resources that can support broader NLP research and downstream applications for underrepresented languages like Ladin.

Abstract: The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

[36] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

Junghwan Lim, Gangwon Jo, Sungmin Lee, Jiyoung Park, Dongseok Kim, Jihwan Kim, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Kibong Choi, Jaeyeon Huh, Beomgyu Kim, Jangwoong Kim, Taehyun Kim, Haesol Lee, Jeesoo Lee, Dongpin Oh, Changseok Song, Daewon Suh

Main category: cs.CL

TL;DR: Llama-3-Motif is a 102B parameter language model built on Llama 3 architecture, specifically enhanced for Korean language while maintaining strong English performance using advanced training techniques.

Details

Motivation: To develop a language model that excels in Korean language capabilities while preserving strong English performance, addressing the need for high-quality Korean language models comparable to state-of-the-art models like GPT-4.

Method: Built on Llama 3 architecture using advanced training techniques including LlamaPro and Masked Structure Growth for efficient scaling. Trained on MoAI platform across hyperscale GPU clusters with a carefully curated balanced dataset of Korean and English data.

Result: Shows decent performance on Korean-specific benchmarks, outperforms existing models, and achieves results comparable to GPT-4.

Conclusion: Llama-3-Motif successfully demonstrates that specialized language models can achieve state-of-the-art performance in specific languages while maintaining strong capabilities in other languages through advanced training techniques and balanced data curation.

Abstract: We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.

[37] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

Zhaoyan Gong, Juan Li, Zhiqiang Liu, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: RTQA is a novel framework for temporal knowledge graph question answering that uses recursive decomposition and LLMs to handle complex temporal queries without training, achieving state-of-the-art performance.

Details

Motivation: Current TKGQA methods struggle with complex temporal queries, limited reasoning abilities, and error propagation in decomposition frameworks, requiring a more robust solution.

Method: RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation for fault tolerance through three core components: Temporal Question Decomposer, Recursive Solver, and Answer Aggregator.

Result: Experiments on MultiTQ and TimelineKGQA benchmarks show significant Hits@1 improvements in ‘Multiple’ and ‘Complex’ categories, outperforming state-of-the-art methods.

Conclusion: RTQA effectively addresses limitations of existing TKGQA methods by enhancing reasoning capabilities without requiring training, demonstrating superior performance on complex temporal queries.

Abstract: Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in “Multiple” and “Complex” categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.

[38] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero

Main category: cs.CL

TL;DR: LLMs show significant performance drops when benchmark questions are paraphrased, challenging the reliability of standard evaluations and highlighting robustness issues with linguistic variability.

Details

Motivation: To assess whether standard benchmark evaluations reliably measure LLM capabilities in real-world scenarios where questions appear in diverse linguistic forms rather than fixed standardized formats.

Method: Systematically generated various paraphrases of all questions across six common benchmarks and measured effectiveness variations across 34 state-of-the-art LLMs of different sizes and capabilities.

Result: While LLM rankings remained relatively stable across paraphrased inputs, absolute effectiveness scores significantly declined, indicating models struggle with linguistic variability.

Conclusion: Standard benchmark evaluations may not fully capture real-world robustness, necessitating the development of robustness-aware benchmarks that better reflect practical deployment scenarios with linguistic variations.

Abstract: Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model’s robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

[39] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra, Arihant Rastogi, Agyeya Negi, Shashwat Goel, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: Models show increasing cross-lingual consistency as they grow larger and more capable, with self-consistency across languages exceeding inter-model agreement within the same language.

Details

Motivation: To understand how similar model outputs are across different languages and whether model size and capability affect cross-lingual consistency.

Method: Used the κ_p similarity metric to analyze 20 languages and 47 subjects in GlobalMMLU, comparing model responses across languages and between different models.

Result: Model responses become more consistent across languages as model size and capability increase. Models show greater cross-lingual consistency within themselves than agreement with other models in the same language.

Conclusion: κ_p is a valuable tool for evaluating multilingual reliability and can guide development of more consistent multilingual systems, demonstrating that larger models achieve better cross-lingual consistency.

Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model’s responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

[40] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

Han Xiaohui, Zhang Yunlong, Guo Yuxi

Main category: cs.CL

TL;DR: First automated Chinese functional syntax annotation system using RoBERTa, achieving 0.852 F1 score for identifying syntactic elements like Subject, Verb, and Complement.

Details

Motivation: Lack of automatic annotation system for Chinese texts based on Systemic Functional Grammar/Cardiff Grammar theory, which limits application and promotion of these linguistic theories.

Method: Fine-tuned RoBERTa-Chinese wwm-ext model on 4,100 annotated sentences from People’s Daily 2014 corpus for named entity recognition of functional syntax elements.

Result: Achieved 0.852 F1 score on test set, significantly outperforming other models. Excellent performance on core elements (Subject, Main Verb, Complement) but room for improvement on imbalanced labels.

Conclusion: First integration of functional syntax with attention-based NLP models, providing new method for automated Chinese functional syntax analysis and foundation for future research.

Abstract: Systemic Functional Grammar and its branch, Cardiff Grammar, have been widely applied to discourse analysis, semantic function research, and other tasks across various languages and texts. However, an automatic annotation system based on this theory for Chinese texts has not yet been developed, which significantly constrains the application and promotion of relevant theories. To fill this gap, this research introduces a functional syntax annotation model for Chinese based on RoBERTa (Robustly Optimized BERT Pretraining Approach). The study randomly selected 4,100 sentences from the People’s Daily 2014 corpus and annotated them according to functional syntax theory to establish a dataset for training. The study then fine-tuned the RoBERTa-Chinese wwm-ext model based on the dataset to implement the named entity recognition task, achieving an F1 score of 0.852 on the test set that significantly outperforms other comparative models. The model demonstrated excellent performance in identifying core syntactic elements such as Subject (S), Main Verb (M), and Complement (C). Nevertheless, there remains room for improvement in recognizing entities with imbalanced label samples. As the first integration of functional syntax with attention-based NLP models, this research provides a new method for automated Chinese functional syntax analysis and lays a solid foundation for subsequent studies.

[41] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

Zhilin Wang, Zhe Yang, Yun Luo, Yafu Li, Haoran Zhang, Runzhe Zhan, Derek F. Wong, Jizhe Zhou, Yu Cheng

Main category: cs.CL

TL;DR: First framework for synthesizing verifiable sheet music problems based on music theory, creating SSMR-Bench evaluation benchmark and training data for RLVR that improves LLM/MLLM performance on music reasoning and composition.

Details

Motivation: Address the lack of evaluation benchmarks and training data for sheet music reasoning in LLMs/MLLMs, which is crucial for developing AI musicians.

Method: Developed a data synthesis framework that generates verifiable sheet music questions in textual and visual modalities, used for creating SSMR-Bench evaluation benchmark and training data for reinforcement learning with verifiable rewards (RLVR).

Result: Qwen3-8B-Base and Qwen2.5-VL-Instruct showed improvements on SSMR-Bench through RLVR. Qwen3-8B-Base surpassed GPT-4 on MusicTheoryBench and achieved comparable reasoning to GPT-4 with enhanced strategies. Performance on math problems also improved, and enhanced reasoning facilitated music composition.

Conclusion: First to propose music theory-based sheet music problem synthesis, demonstrating effectiveness in advancing model reasoning for sheet music understanding and enabling new possibilities for AI-assisted music creation.

Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models’ reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.

[42] Arabic Chatbot Technologies in Education: An Overview

Hicham Bourhil, Yacine El Younoussi

Main category: cs.CL

TL;DR: Survey of Arabic educational chatbots showing limited adoption of modern AI techniques compared to other languages, with identified research gaps and future directions.

Details

Motivation: The COVID-19 pandemic accelerated e-learning adoption, creating demand for AI-powered educational tools. While chatbots using modern LLMs like BERT and GPT have succeeded in languages like English, Arabic educational chatbots lag behind in adopting these advanced techniques.

Method: Conducted a comprehensive survey of existing Arabic chatbots in education, analyzing their characteristics including adopted approaches, language variety, and performance metrics.

Result: Found that only a few educational Arabic chatbots utilize modern AI techniques despite the success of similar approaches in other languages. Identified specific research gaps in the field.

Conclusion: There is significant untapped potential for Arabic educational chatbots using modern LLM techniques. The study provides direction for future research to bridge this gap and enhance Arabic language educational technology.

Abstract: The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.

[43] Improving Narrative Classification and Explanation via Fine Tuned Language Models

Rishit Tyagi, Rahul Bouri, Mohit Gupta

Main category: cs.CL

TL;DR: This study develops a system for detecting and explaining covert narratives in news articles using fine-tuned BERT for multi-label classification and GPT-4o with ReACT framework for evidence-based explanations, enhanced by auxiliary knowledge to improve accuracy.

Details

Motivation: Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas in news content, making it difficult to analyze bias and sentiment through covert narratives and implicit messaging.

Method: Fine-tune BERT model with recall-oriented approach for narrative detection, then use GPT-4o pipeline with ReACT framework (Reasoning + Acting) and semantic retrieval-based few-shot prompting for explanations, incorporating structured taxonomy table as auxiliary knowledge base.

Result: Integration of auxiliary knowledge in prompts improves both classification accuracy and justification reliability, demonstrating effective narrative detection and explanation capabilities.

Conclusion: The proposed approach successfully addresses multi-label narrative classification and evidence-based explanation generation, with applications in media analysis, education, and intelligence gathering.

Abstract: Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.

[44] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx

Main category: cs.CL

TL;DR: Study investigates constructing stable personalized lexical profiles from minimal spoken data (10 minutes) for enabling lexical alignment in conversational agents, finding optimal profiles with 5 adjectives/conjunctions and 10 adverbs/nouns/pronouns/verbs each.

Details

Motivation: Lexical alignment contributes to successful human communication but remains underexplored in conversational agents despite recent LLM advancements. Need practical methods for enabling lexical alignment in human-agent dialogue.

Method: Varied amounts of transcribed spoken data and number of items per POS category to construct lexical profiles. Evaluated profile performance over time using recall, coverage, and cosine similarity metrics.

Result: Smaller, more compact profiles created from 10 min of transcribed speech (5 adjectives/conjunctions, 10 adverbs/nouns/pronouns/verbs each) offered best balance of performance and data efficiency.

Conclusion: Provides practical insights for constructing stable personalized lexical profiles with minimal data requirements, serving as foundational step toward lexical alignment strategies in conversational agents.

Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

[45] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Main category: cs.CL

TL;DR: MultiWikiQA is a new multilingual reading comprehension dataset covering 306 languages, using Wikipedia articles as context with LLM-generated questions and verbatim answers from the articles.

Details

Motivation: To create a comprehensive multilingual reading comprehension benchmark that spans a wide range of languages (306) to evaluate language models' performance across diverse linguistic contexts.

Method: Used Wikipedia articles as context data, generated questions using a large language model (LLM), and ensured answers appear verbatim in the articles. Conducted crowdsourced human evaluation of question fluency across 30 languages.

Result: Human evaluation showed the generated questions are of good quality. Evaluation of 6 different language models (both decoder and encoder models of varying sizes) revealed the benchmark is sufficiently difficult with large performance discrepancies across languages.

Conclusion: MultiWikiQA provides a challenging multilingual reading comprehension benchmark that demonstrates significant performance variations across languages, with the dataset and survey evaluations made freely available for research use.

Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

[46] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

Wei Liu, Michael Strube

Main category: cs.CL

TL;DR: Joint modeling of entity features and discourse relations significantly improves coherence assessment performance compared to using either feature type alone.

Details

Motivation: Most existing coherence modeling work focuses exclusively on either entity features or discourse relations, with little attention to combining both approaches.

Method: Two methods for jointly modeling entities and discourse relations for coherence assessment were explored and tested.

Result: Experiments on three benchmark datasets showed that integrating both types of features significantly enhances coherence model performance.

Conclusion: Modeling both entity features and discourse relations simultaneously provides substantial benefits for coherence evaluation.

Abstract: In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.

[47] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych

Main category: cs.CL

TL;DR: MAGneT is a multi-agent framework that generates high-quality synthetic psychological counseling sessions by decomposing counselor response generation into specialized sub-tasks, outperforming existing methods in quality, diversity, and therapeutic alignment.

Details

Motivation: The growing demand for scalable psychological counseling requires fine-tuning LLMs with high-quality, privacy-compliant data, but such data remains scarce, creating a need for better synthetic data generation methods.

Method: A multi-agent framework that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique, with a unified evaluation framework integrating diverse automatic and expert metrics across nine aspects.

Result: MAGneT significantly outperforms existing methods, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on CTRS, with experts preferring MAGneT-generated sessions in 77.2% of cases. Fine-tuning on MAGneT data shows 6.3% and 7.3% improvements respectively.

Conclusion: MAGneT provides a superior approach for generating high-quality synthetic psychological counseling data that better captures real counseling structure and nuance, enabling better fine-tuning of open-source LLMs for scalable mental health applications.

Abstract: The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.

Congbo Ma, Yuxia Wang, Jia Wu, Jian Yang, Jing Du, Zitai Qiu, Qing Li, Hu Wang, Preslav Nakov

Main category: cs.CL

TL;DR: SED-Aug is a dual augmentation framework that combines explicit text-based and implicit feature-space augmentation to improve social event detection without requiring additional labeled data.

Details

Motivation: Social event detection relies on labeled data which is costly and labor-intensive to obtain, creating a need for methods that can enhance model performance without additional annotations.

Method: Combines explicit text augmentation using large language models with five generation strategies, and implicit feature-space augmentation with five novel perturbation techniques on structural fused embeddings that preserve semantic and relational properties.

Result: Outperforms best baseline by 17.67% on Twitter2012 and 15.57% on Twitter2018 datasets in terms of average F1 score.

Conclusion: SED-Aug effectively enhances data diversity and model robustness through dual augmentation, significantly improving social event detection performance without requiring additional labeled data.

Abstract: Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.

[49] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang

Main category: cs.CL

TL;DR: Inverse IFEval is a benchmark that evaluates LLMs’ counter-intuitive ability to override training biases and follow adversarial instructions, revealing cognitive inertia issues in current models.

Details

Motivation: LLMs often struggle with cognitive inertia - they fail to follow instructions that conflict with standardized patterns learned during supervised fine-tuning, limiting their adaptability in real-world scenarios.

Method: Proposed Inverse IFEval benchmark with 8 challenge types (Question Correction, Intentional Textual Flaws, etc.), created 1012 high-quality Chinese/English questions across 23 domains using human-in-the-loop pipeline, and evaluated using optimized LLM-as-a-Judge framework.

Result: Experiments on leading LLMs demonstrate the necessity of the benchmark, showing models’ limitations in counter-intuitive reasoning and adaptability to unconventional contexts.

Conclusion: Future alignment efforts should focus not only on fluency and factual correctness but also on adaptability, and Inverse IFEval can serve as both diagnostic tool and foundation for developing methods to mitigate cognitive inertia and enhance instruction-following reliability.

Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

[50] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Juraj Vladika, Mahdi Dhaini, Florian Matthes

Main category: cs.CL

TL;DR: LLMs in healthcare risk providing outdated medical advice due to static training data. Two new QA datasets (MedRevQA and MedChangeQA) reveal consistent reliance on outdated knowledge across 8 prominent LLMs, with analysis of training data and strategies to address this issue.

Details

Motivation: LLMs show great potential for healthcare but their static training data poses risks when medical knowledge evolves, potentially leading to harmful advice and failed clinical reasoning.

Method: Created two novel QA datasets from systematic reviews: MedRevQA (16,501 QA pairs) and MedChangeQA (512 QA pairs where medical consensus changed). Evaluated 8 prominent LLMs on these datasets to assess outdated knowledge reliance.

Result: All 8 LLMs showed consistent reliance on outdated medical knowledge. Analysis revealed influence of obsolete pre-training data and training strategies contributing to this phenomenon.

Conclusion: The study highlights the critical need for developing more current and reliable medical AI systems, with proposed future directions for mitigating outdated knowledge issues in LLMs for healthcare applications.

Abstract: The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.

[51] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao, Elisa Kreiss

Main category: cs.CL

TL;DR: LLM gender bias evaluations are highly sensitive to prompt design, with minor changes significantly altering bias measurements and sometimes reversing outcomes, raising concerns about ecological validity of benchmarks.

Details

Motivation: As LLMs are increasingly used in socially impactful settings, there are growing concerns about gender bias and efforts to measure/mitigate it, but current evaluation methods may not reflect natural language distributions.

Method: Tested models under different prompt conditions that make testing context and gender-focused content salient, using four task formats with both token-probability and discrete-choice metrics.

Result: Even minor prompt changes can substantially alter bias outcomes, sometimes reversing direction entirely. Discrete-choice metrics tend to amplify bias relative to probabilistic measures.

Conclusion: Findings highlight brittleness of LLM gender bias evaluations and raise questions about whether controlled testing triggers ’testing mode’ performance, challenging ecological validity of future benchmarks.

Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode’’ performance, and what does this mean for the ecological validity of future benchmarks.

[52] Can Language Models Handle a Non-Gregorian Calendar?

Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

Main category: cs.CL

TL;DR: Evaluation of language models’ ability to handle Japanese calendar tasks, revealing that even Japanese-centric models struggle with calendar arithmetic and consistency.

Details

Motivation: Most prior work on temporal reasoning in LMs has focused solely on the Gregorian calendar, ignoring culturally specific non-Gregorian systems like Japanese, Hijri, and Hebrew calendars that are actively used.

Method: Created datasets for four tasks requiring temporal knowledge and reasoning, then evaluated a range of English-centric and Japanese-centric language models on their Japanese calendar handling capabilities.

Result: Some models can perform basic calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and maintaining consistency across different calendar systems.

Conclusion: The findings highlight the need to develop language models with better culture-specific calendar understanding capabilities beyond just Gregorian calendar systems.

Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.

[53] MyProfessors: Mining Turkish Student Reviews

Ibrahim Faruk Ceylan, Necmettin Bera Calik

Main category: cs.CL

TL;DR: Hocalarim is the largest Turkish student review dataset with 5000+ professor reviews, analyzing rating patterns, institutional impact, and student bias.

Details

Motivation: To create a comprehensive Turkish-language dataset for analyzing student feedback on professors and understand rating behaviors across different educational contexts.

Method: Collected over 5000 online student reviews with 1-5 star ratings across different educational aspects, then performed statistical analysis to examine institutional type impact and student bias correlation.

Result: The dataset provides insights into how institution type affects student ratings and reveals patterns in students’ tendencies to give positive or negative feedback.

Conclusion: Hocalarim serves as a valuable resource for Turkish NLP research and educational analysis, enabling deeper understanding of student evaluation patterns in Turkish higher education.

Abstract: We introduce Hocalarim (MyProfessors), the largest student review dataset available for the Turkish language. It consists of over 5000 professor reviews left online by students, with different aspects of education rated on a scale of 1 to 5 stars. We investigate the properties of the dataset and present its statistics. We examine the impact of students’ institution type on their ratings and the correlation of students’ bias to give positive or negative feedback.

[54] Mitigating Bias in Text Classification via Prompt-Based Text Transformation

Charmaine Barker, Dimitar Kazakov

Main category: cs.CL

TL;DR: Prompt-based text rewriting reduces demographic signals in language while preserving meaning, offering a practical approach to mitigate bias in text classification.

Details

Motivation: Language models can learn and rely on linguistic signals that correlate with protected characteristics, potentially leading to biased outcomes in automated decision-making systems.

Method: Using ChatGPT to rewrite text through simplification, neutralisation, localisation, and formalisation techniques to reduce demographic signals while maintaining semantic content.

Result: Significant drop in location classification accuracy across multiple models after transformation, while sentiment analysis and rating prediction tasks showed preserved meaning integrity.

Conclusion: Prompt-based rewriting provides a practical and generalizable method for reducing bias in text classification by minimizing reliance on group-specific linguistic cues.

Abstract: The presence of specific linguistic signals particular to a certain sub-group can become highly salient to language models during training. In automated decision-making settings, this may lead to biased outcomes when models rely on cues that correlate with protected characteristics. We investigate whether prompting ChatGPT to rewrite text using simplification, neutralisation, localisation, and formalisation can reduce demographic signals while preserving meaning. Experimental results show a statistically significant drop in location classification accuracy across multiple models after transformation, suggesting reduced reliance on group-specific language. At the same time, sentiment analysis and rating prediction tasks confirm that the core meaning of the reviews remains greatly intact. These results suggest that prompt-based rewriting offers a practical and generalisable approach for mitigating bias in text classification.

[55] Exploring Linguistic Features for Turkish Text Readability

Ahmet Yavuz Uluslu, Gerold Schneider

Main category: cs.CL

TL;DR: First comprehensive study on automatic readability assessment for Turkish texts combining neural networks with multi-level linguistic features

Details

Motivation: To develop an advanced readability tool for Turkish and evaluate traditional vs modern methods while identifying key linguistic determinants of readability

Method: Combines state-of-the-art neural network models with linguistic features at lexical, morphological, syntactic and discourse levels

Result: Developed an advanced readability assessment tool and identified key linguistic features that determine Turkish text readability

Conclusion: Presents the first comprehensive readability assessment system for Turkish, demonstrating the effectiveness of combining neural networks with multi-level linguistic analysis

Abstract: This paper presents the first comprehensive study on automatic readability assessment of Turkish texts. We combine state-of-the-art neural network models with linguistic features at lexical, morphological, syntactic and discourse levels to develop an advanced readability tool. We evaluate the effectiveness of traditional readability formulas compared to modern automated methods and identify key linguistic features that determine the readability of Turkish texts.

[56] R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Ken Deng, Jiaheng Liu, He Zhu, Congnan Liu, Jingxin Li, Jiakai Wang, Peng Zhao, Chenchen Zhang, Yanan Wu, Xueqiao Yin, Yuanxing Zhang, Zizheng Zhan, Wenbo Su, Bangyu Xiang, Tiezheng Ge, Bo Zheng

Main category: cs.CL

TL;DR: R2C2-Coder enhances repository-level code completion by better utilizing project context through retrieval-based prompt construction and creates a more challenging benchmark with context perturbation.

Details

Motivation: Existing repository-level code completion methods fail to fully utilize extensive project context like relevant files and class hierarchies, and current benchmarks are limited in reflecting real-world completion abilities.

Method: Proposes R2C2-Coder with two components: R2C2-Enhance constructs completion prompts by building candidate retrieval pools and retrieving context for each cursor position; R2C2-Bench creates a challenging benchmark using context perturbation to simulate real-world scenarios.

Result: Extensive results on multiple benchmarks demonstrate the effectiveness of the proposed R2C2-Coder approach.

Conclusion: The R2C2-Coder framework successfully addresses limitations in existing repository-level code completion by better utilizing project context and providing a more realistic benchmark for evaluating model performance.

Abstract: Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed. However, existing repository-level code completion methods often fall short of fully using the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies. Besides, the existing benchmarks usually focus on limited code completion scenarios, which cannot reflect the repository-level code completion abilities well of existing methods. To address these limitations, we propose the R2C2-Coder to enhance and benchmark the real-world repository-level code completion abilities of code Large Language Models, where the R2C2-Coder includes a code prompt construction method R2C2-Enhance and a well-designed benchmark R2C2-Bench. Specifically, first, in R2C2-Enhance, we first construct the candidate retrieval pool and then assemble the completion prompt by retrieving from the retrieval pool for each completion cursor position. Second, based on R2C2 -Enhance, we can construct a more challenging and diverse R2C2-Bench with training, validation and test splits, where a context perturbation strategy is proposed to simulate the real-world repository-level code completion well. Extensive results on multiple benchmarks demonstrate the effectiveness of our R2C2-Coder.

[57] DynaSaur: Large Language Agents Beyond Predefined Actions

Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A. Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, Franck Dernoncourt, Tianyi Zhou

Main category: cs.CL

TL;DR: LLM agent framework that dynamically creates and executes programs instead of using fixed predefined actions, improving flexibility and performance in open-ended environments.

Details

Motivation: Existing LLM agent systems with fixed action sets are limited in open-world scenarios, restricting planning capabilities and requiring impractical human effort to enumerate all possible actions.

Method: Proposes a framework where agents generate and execute programs in a general-purpose programming language, with actions accumulated and reused over time.

Result: Extensive experiments show significant improvements in flexibility and performance over fixed action set methods, enabling adaptation and recovery in scenarios with insufficient predefined actions.

Conclusion: Dynamic action creation through program generation outperforms traditional fixed action sets, making LLM agents more capable in complex, open-ended environments.

Abstract: Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly scoped environments, it presents two major challenges for real-world, open-ended scenarios: (1) it significantly restricts the planning and acting capabilities of LLM agents, and (2) it requires substantial human effort to enumerate and implement all possible actions, which is impractical in complex environments with a vast number of potential actions. To address these limitations, we propose an LLM agent framework that can dynamically create and compose actions as needed. In this framework, the agent interacts with its environment by generating and executing programs written in a general-purpose programming language. Moreover, generated actions are accumulated over time for future reuse. Our extensive experiments across multiple benchmarks show that this framework significantly improves flexibility and outperforms prior methods that rely on a fixed action set. Notably, it enables LLM agents to adapt and recover in scenarios where predefined actions are insufficient or fail due to unforeseen edge cases. Our code can be found in https://github.com/adobe-research/dynasaur.

[58] ACING: Actor-Critic for Instruction Learning in Black-Box LLMs

Salma Kharrat, Fares Fourati, Marco Canini

Main category: cs.CL

TL;DR: ACING is an actor-critic RL framework that automatically optimizes instructions for black-box LLMs, achieving better performance than human-written prompts in 76% of tasks with up to 33-point improvements.

Details

Motivation: Manual instruction crafting for LLMs requires substantial human effort, and optimizing instructions is challenging for black-box LLMs where model parameters and gradients are inaccessible.

Method: Actor-critic reinforcement learning framework that formulates instruction optimization as a stateless, continuous-action problem, enabling exploration of infinite instruction spaces using only black-box feedback.

Result: Outperforms human-written prompts in 76% of instruction-induction tasks, with gains up to 33 points and 10-point median improvement over best automatic baseline across 33 tasks spanning instruction-induction, summarization, and chain-of-thought reasoning.

Conclusion: ACING provides an effective automated approach for instruction optimization in black-box LLMs, demonstrating robustness and efficiency through extensive ablations.

Abstract: The effectiveness of Large Language Models (LLMs) in solving tasks depends significantly on the quality of their instructions, which often require substantial human effort to craft. This underscores the need for automated instruction optimization. However, optimizing instructions is particularly challenging when working with black-box LLMs, where model parameters and gradients are inaccessible. We introduce ACING, an actor-critic reinforcement learning framework that formulates instruction optimization as a stateless, continuous-action problem, enabling exploration of infinite instruction spaces using only black-box feedback. ACING automatically discovers prompts that outperform human-written prompts in 76% of instruction-induction tasks, with gains of up to 33 points and a 10-point median improvement over the best automatic baseline in 33 tasks spanning instruction-induction, summarization, and chain-of-thought reasoning. Extensive ablations highlight its robustness and efficiency. An implementation of ACING is available at https://github.com/salmakh1/ACING.

[59] Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts

Preethi Seshadri, Hongyu Chen, Sameer Singh, Seraphina Goldfarb-Tarrant

Main category: cs.CL

TL;DR: LLM-based hiring systems show significant fairness issues in resume summarization and applicant ranking tasks, with notable biases across demographic groups particularly for race, and high sensitivity to both demographic and non-demographic perturbations.

Details

Motivation: To examine allocational fairness of LLM-based hiring systems in generative and retrieval settings, as these models are increasingly deployed in high-stakes applications like hiring but their potential for unfair decision-making remains understudied.

Method: Constructed synthetic resume dataset with controlled perturbations and curated job postings to investigate whether model behavior differs across demographic groups through two HR tasks: resume summarization and applicant ranking.

Result: Generated summaries show meaningful differences more frequently for race than gender perturbations. Models display non-uniform retrieval selection patterns across demographic groups and high ranking sensitivity to both gender and race perturbations. Retrieval models show comparable sensitivity to demographic and non-demographic changes.

Conclusion: LLM-based hiring systems, especially in retrieval stage, can exhibit notable biases leading to discriminatory outcomes, with fairness issues potentially stemming from broader model brittleness rather than just demographic-specific biases.

Abstract: Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making remains understudied in generative and retrieval settings. In this work, we examine the allocational fairness of LLM-based hiring systems through two tasks that reflect actual HR usage: resume summarization and applicant ranking. By constructing a synthetic resume dataset with controlled perturbations and curating job postings, we investigate whether model behavior differs across demographic groups. Our findings reveal that generated summaries exhibit meaningful differences more frequently for race than for gender perturbations. Models also display non-uniform retrieval selection patterns across demographic groups and exhibit high ranking sensitivity to both gender and race perturbations. Surprisingly, retrieval models can show comparable sensitivity to both demographic and non-demographic changes, suggesting that fairness issues may stem from broader model brittleness. Overall, our results indicate that LLM-based hiring systems, especially in the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.

[60] Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective

Yiyao Yu, Yuxiang Zhang, Dongdong Zhang, Xiao Liang, Hengyuan Zhang, Xingxing Zhang, Ziyi Yang, Mahmoud Khademi, Hany Awadalla, Junjie Wang, Yujiu Yang, Furu Wei

Main category: cs.CL

TL;DR: CoR-Math-7B introduces Chain-of-Reasoning framework combining Natural Language, Algorithmic, and Symbolic reasoning paradigms with Progressive Paradigm Training, achieving significant improvements over SOTA models including GPT-4o.

Details

Motivation: Current LLMs rely on single-paradigm reasoning which limits effectiveness across diverse mathematical tasks, requiring a unified approach that integrates multiple reasoning methods.

Method: Chain-of-Reasoning (CoR) framework integrates Natural Language Reasoning, Algorithmic Reasoning, and Symbolic Reasoning. Uses Progressive Paradigm Training (PPT) strategy to progressively master paradigms, resulting in CoR-Math-7B model.

Result: Achieves 41.0% absolute improvement over GPT-4o in theorem proving and 15.0% improvement over RL-based methods on MATH benchmark for arithmetic tasks. Enables zero-shot generalization across tasks.

Conclusion: The CoR framework significantly enhances mathematical comprehension by synergistically combining multiple reasoning paradigms, demonstrating superior performance and generalization capabilities compared to existing approaches.

Abstract: Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet often rely on single-paradigm reasoning, limiting their effectiveness across diverse tasks. We introduce Chain-of-Reasoning (CoR), a novel unified framework integrating multiple reasoning paradigms–Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)–to enable synergistic collaboration. CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy for models to progressively master these paradigms, leading to CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4o in theorem proving and a 15.0% improvement over RL-based methods on the MATH benchmark in arithmetic tasks. These results show the enhanced mathematical comprehension ability of our model, enabling zero-shot generalization across tasks.

[61] A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Yi Chang, Xiao Huang

Main category: cs.CL

TL;DR: GraphRAG is a new paradigm that uses graph-structured knowledge representation to overcome limitations of traditional RAG systems, enabling better domain-specific LLM applications through structured retrieval and knowledge integration.

Details

Motivation: Traditional RAG systems based on flat text retrieval struggle with complex query understanding, knowledge integration across distributed sources, and efficiency bottlenecks in professional domains requiring deep expertise.

Method: GraphRAG introduces three key innovations: graph-structured knowledge representation to capture entity relationships, efficient graph-based retrieval techniques for context-preserving knowledge with multihop reasoning, and structure-aware knowledge integration algorithms for logical generation.

Result: GraphRAG revolutionizes domain-specific LLM applications by addressing traditional RAG limitations, providing a systematic framework for professional field customization with improved accuracy and coherence.

Conclusion: GraphRAG represents a promising research direction that enhances LLM capabilities in specialized domains through structured knowledge representation and retrieval, with collected resources available for community development.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-Augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphRAG.

[62] An Unsupervised Natural Language Processing Pipeline for Assessing Referral Appropriateness

Vittorio Torri, Annamaria Bottelli, Michele Ercolanoni, Olivia Leoni, Francesca Ieva

Main category: cs.CL

TL;DR: Unsupervised NLP pipeline using Italian medical Transformer embeddings to analyze diagnostic referral appropriateness from free-text data, achieving high performance (93-95% precision/recall) and informing regional healthcare policy.

Details

Motivation: Assessing diagnostic referral appropriateness is challenging when reasons are recorded as free text rather than structured codes, particularly in the Italian NHS where this gap needs addressing to improve healthcare efficiency.

Method: Fully unsupervised NLP pipeline leveraging Transformer-based embeddings pre-trained on Italian medical texts to cluster referral reasons and assess alignment with appropriateness guidelines, tested on two large regional datasets (venous echocolordoppler and colonoscopy referrals).

Result: High performance in identifying referral reasons (92-94% precision, 83-93% recall) and appropriateness assessment (94-95% precision, 92-94% recall), identifying inappropriate referral groups and contextual variations that informed new regional healthcare policy.

Conclusion: Robust, scalable unsupervised NLP pipeline for assessing referral appropriateness in real-world datasets, providing public health authorities with deployable AI tool to monitor practices and support evidence-based policy making.

Abstract: Objective: Assessing the appropriateness of diagnostic referrals is critical for improving healthcare efficiency and reducing unnecessary procedures. However, this task becomes challenging when referral reasons are recorded only as free text rather than structured codes, like in the Italian NHS. To address this gap, we propose a fully unsupervised Natural Language Processing (NLP) pipeline capable of extracting and evaluating referral reasons without relying on labelled datasets. Methods: Our pipeline leverages Transformer-based embeddings pre-trained on Italian medical texts to cluster referral reasons and assess their alignment with appropriateness guidelines. It operates in an unsupervised setting and is designed to generalize across different examination types. We analyzed two complete regional datasets from the Lombardy Region (Italy), covering all referrals between 2019 and 2021 for venous echocolordoppler of the lower limbs (ECD;n=496,971; development) and flexible endoscope colonoscopy (FEC; n=407,949; testing only). For both, a random sample of 1,000 referrals was manually annotated to measure performance. Results: The pipeline achieved high performance in identifying referral reasons (Prec=92.43% (ECD), 93.59% (FEC); Rec=83.28% (ECD), 92.70% (FEC)) and appropriateness (Prec=93.58% (ECD), 94.66% (FEC); Rec=91.52% (ECD), 93.96% (FEC)). At the regional level, the analysis identified relevant inappropriate referral groups and variation across contexts, findings that informed a new Lombardy Region resolution to reinforce guideline adherence. Conclusions: This study presents a robust, scalable, unsupervised NLP pipeline for assessing referral appropriateness in large, real-world datasets. It demonstrates how such data can be effectively leveraged, providing public health authorities with a deployable AI tool to monitor practices and support evidence-based policy.

[63] HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents

Mohammad Amin Abbasi, Farnaz Sadat Mirnezami, Ali Neshati, Hassan Naderi

Main category: cs.CL

TL;DR: HamRaz is a Persian-language mental health dataset using Person-Centered Therapy, combining scripted dialogues with LLM role-playing to capture cultural nuances, with evaluation showing superior performance in empathy and realism.

Details

Motivation: To address the lack of culturally adapted mental health resources for Persian-speaking communities and bridge the gap between language, culture, and AI-assisted mental health support in underrepresented populations.

Method: Combines script-based dialogue with adaptive large language model role-playing to capture ambiguity and emotional nuance. Introduces HamRazEval, a dual-framework evaluation system using general metrics and specialized psychological relationship measures.

Result: Human evaluations show HamRaz outperforms existing baselines in empathy, coherence, and realism, demonstrating effectiveness in culturally adapted mental health support.

Conclusion: This resource contributes to Digital Humanities by providing a culturally sensitive Persian-language dataset that effectively bridges language, culture, and mental health support for underrepresented communities.

Abstract: We present HamRaz, a culturally adapted Persian-language dataset for AI-assisted mental health support, grounded in Person-Centered Therapy (PCT). To reflect real-world therapeutic challenges, we combine script-based dialogue with adaptive large language models (LLM) role-playing, capturing the ambiguity and emotional nuance of Persian-speaking clients. We introduce HamRazEval, a dual-framework for assessing conversational and therapeutic quality using General Metrics and specialized psychological relationship measures. Human evaluations show HamRaz outperforms existing baselines in empathy, coherence, and realism. This resource contributes to the Digital Humanities by bridging language, culture, and mental health in underrepresented communities.

[64] HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li

Main category: cs.CL

TL;DR: The paper introduces HalluEntity, a dataset for entity-level hallucination detection in LLMs, addressing the limitation of existing sentence/paragraph-level approaches. It evaluates 17 LLMs and finds token-level uncertainty methods over-predict while context-aware methods perform better but suboptimally.

Details

Motivation: Current hallucination detection methods operate at sentence/paragraph level, lacking granularity to pinpoint specific hallucinated entities, which is problematic for long-form outputs mixing accurate and fabricated information.

Method: Proposed HalluEntity dataset with entity-level hallucination annotations, then comprehensively evaluated 17 modern LLMs using uncertainty-based approaches including token probability and context-aware methods.

Result: Token-level uncertainty approaches tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Identified relationships between hallucination tendencies and linguistic properties.

Conclusion: Highlights the need for more granular entity-level hallucination detection and provides important directions for future research through the HalluEntity dataset and comprehensive evaluation.

Abstract: To mitigate the impact of hallucination nature of LLMs, many studies propose detecting hallucinated generation through uncertainty estimation. However, these approaches predominantly operate at the sentence or paragraph level, failing to pinpoint specific spans or entities responsible for hallucinated content. This lack of granularity is especially problematic for long-form outputs that mix accurate and fabricated information. To address this limitation, we explore entity-level hallucination detection. We propose a new data set, HalluEntity, which annotates hallucination at the entity level. Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Through an in-depth qualitative study, we identify relationships between hallucination tendencies and linguistic properties and highlight important directions for future research. HalluEntity: https://huggingface.co/datasets/samuelyeh/HalluEntity

[65] Autoformalization in the Wild: Assessing LLMs on Real-World Mathematical Definitions

Lan Zhang, Marco Valentino, Andre Freitas

Main category: cs.CL

TL;DR: LLMs struggle with autoformalizing real-world mathematical definitions but show improvement with refinement strategies and definition grounding.

Details

Motivation: Bridge the gap between informal mathematics and formal languages through autoformalization using LLMs, particularly for sophisticated real-world mathematical statements.

Method: Introduce two new datasets (Def_Wiki and Def_ArXiv), evaluate LLMs on formalizing definitions into Isabelle/HOL, and test refinement strategies including external feedback from proof assistants and formal definition grounding.

Result: Definitions present greater challenge than existing benchmarks; LLMs struggle with self-correction and library alignment, but refinement methods improve self-correction by 16% and reduce undefined errors by 43%.

Conclusion: Structured refinement and definition grounding strategies show promise for enhancing LLM-based autoformalization in real-world mathematical scenarios.

Abstract: Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge the gap between informal mathematics and formal languages through autoformalization. However, it is still unclear how well LLMs generalize to sophisticated and naturally occurring mathematical statements. To address this gap, we investigate the task of autoformalizing real-world mathematical definitions: a critical component of mathematical discourse. Specifically, we introduce two novel resources for autoformalization, collecting definitions from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically evaluate a range of LLMs, analyzing their ability to formalize definitions into Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs’ performance including refinement through external feedback from Proof Assistants, and formal definition grounding, where we augment LLMs’ formalizations through relevant contextual elements from formal mathematical libraries. Our findings reveal that definitions present a greater challenge compared to existing benchmarks, such as miniF2F. In particular, we found that LLMs still struggle with self-correction, and aligning with relevant mathematical libraries. At the same time, structured refinement methods and definition grounding strategies yield notable improvements of up to 16% on self-correction capabilities and 43% on the reduction of undefined errors, highlighting promising directions for enhancing LLM-based autoformalization in real-world scenarios.

[66] Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

Leonardo Ranaldi, Marco Valentino, Andrè Freitas

Main category: cs.CL

TL;DR: QuaSAR improves Chain-of-Thought reasoning by using quasi-symbolic explanations that combine natural language with formal variables/predicates, enhancing robustness and accuracy without full formalization.

Details

Motivation: Traditional CoT reasoning suffers from content biases affecting robustness and faithfulness, while fully symbolic approaches require complete natural language to formal language translation, reducing efficiency and flexibility.

Method: QuaSAR guides LLMs to operate at higher abstraction levels using quasi-symbolic explanations that formalize only relevant variables and predicates, allowing coexistence of symbolic elements with natural language.

Result: Improves CoT-based methods by up to 8% accuracy, enhances robustness and consistency on adversarial variations in both natural language (MMLU-Redux) and symbolic reasoning tasks (GSM-Symbolic).

Conclusion: Quasi-symbolic abstractions provide an effective trade-off between pure natural language reasoning and full formalization, improving reasoning capabilities while maintaining efficiency and flexibility.

Abstract: Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).

[67] Rapid Word Learning Through Meta In-Context Learning

Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake

Main category: cs.CL

TL;DR: Minnow is a meta-training method that teaches language models to learn new words from few examples using a placeholder token, achieving performance comparable to large pre-trained models with much less data.

Details

Motivation: Current language models have underexplored abilities for few-shot word learning, while humans can quickly learn and flexibly use new words from limited examples.

Method: Meta-training language models to generate new word usages given few in-context examples, using a special placeholder token to represent new words, repeated across many words to develop general word-learning ability.

Result: Minnow enables strong few-shot word learning comparable to large LLMs pre-trained on orders of magnitude more data, and improves discrimination, syntactic categorization, and generation abilities for new words.

Conclusion: Minnow demonstrates high data efficiency and potential to significantly improve language model performance in word learning tasks with minimal training examples.

Abstract: Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word’s usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

[68] FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

Mollie Shichman, Claire Bonial, Austin Blodgett, Taylor Hudson, Francis Ferraro, Rachel Rudinger

Main category: cs.CL

TL;DR: FRIDA pipeline creates small LLMs for robot disaster relief with physical reasoning using synthetic data from expert prompts, showing that minimal physical state/function data outperforms full synthetic datasets.

Details

Motivation: Large LLMs have good physical reasoning for disaster relief robotics but are too big for deployment. Need smaller models with similar capabilities.

Method: Domain experts create few-shot prompts to generate synthetic data for fine-tuning small instruction-tuned models. Hand-curated datasets and ablation study to identify most effective data types.

Result: FRIDA models trained only on objects’ physical state and function data outperformed both full synthetic data models and base models in evaluations.

Conclusion: FRIDA pipeline can instill physical common sense in small LLMs with minimal data, making them suitable for robotic deployment in disaster scenarios.

Abstract: During Human Robot Interactions in disaster relief scenarios, Large Language Models (LLMs) have the potential for substantial physical reasoning to assist in mission objectives. However, these reasoning capabilities are often found only in larger models, which are not currently reasonable to deploy on robotic systems due to size constraints. To meet our problem space requirements, we introduce a dataset and pipeline to create Field Reasoning and Instruction Decoding Agent (FRIDA) models. In our pipeline, domain experts and linguists combine their knowledge to make high-quality, few-shot prompts used to generate synthetic data for fine-tuning. We hand-curate datasets for this few-shot prompting and for evaluation to improve LLM reasoning on both general and disaster-specific objects. We concurrently run an ablation study to understand which kinds of synthetic data most affect performance. We fine-tune several small instruction-tuned models and find that ablated FRIDA models only trained on objects’ physical state and function data outperformed both the FRIDA models trained on all synthetic data and the base models in our evaluation. We demonstrate that the FRIDA pipeline is capable of instilling physical common sense with minimal data.

[69] Explicit Learning and the LLM in Machine Translation

Malik Marmonier, Rachel Bawden, Benoît Sagot

Main category: cs.CL

TL;DR: LLMs can learn new languages from grammar book explanations (explicit learning), but this ability decreases with linguistic complexity. Fine-tuning helps but struggles with novel/complex features.

Details

Motivation: To investigate whether LLMs can learn new languages through explicit explanations found in grammar books, rather than just from large corpora, which could benefit low-resource languages.

Method: Controlled translation experiments between English and constructed languages generated cryptographically from Latin/French. Used supervised fine-tuning with chains of thought.

Result: LLMs do possess measurable explicit learning ability, but it diminishes with linguistic complexity. Fine-tuning enhances performance but struggles with generalization to novel/complex features.

Conclusion: More diverse training sets and alternative fine-tuning strategies are needed to improve explicit learning for low-resource languages typically described in grammar books.

Abstract: This study explores an LLM’s ability to learn new languages using explanations found in a grammar book, a process we term “explicit learning.” To rigorously assess this ability, we design controlled translation experiments between English and constructed languages generated, through specific cryptographic means, from Latin or French. Contrary to previous studies, our results demonstrate that LLMs do possess a measurable capacity for explicit learning. This ability, however, diminishes as the complexity of the linguistic phenomena to be learned increases. Supervised fine-tuning on ad hoc chains of thought significantly enhances LLM performance but struggles to generalize to typologically novel or more complex linguistic features. These findings point to the need for more diverse training sets and alternative fine-tuning strategies to further improve explicit learning by LLMs, benefiting low-resource languages typically described in grammar books but lacking extensive corpora.

[70] FutureGen: A RAG-based Approach to Generate the Future Work of Scientific Article

Ibrahim Al Azher, Miftahul Jannat Mokarrama, Zhishuai Guo, Sagnik Ray Choudhury, Hamed Alhoori

Main category: cs.CL

TL;DR: This paper presents a RAG-based system using LLMs to generate future work suggestions from scientific articles, incorporating LLM feedback and evaluation mechanisms.

Details

Motivation: Future work sections are valuable resources for researchers but are often limited by individual paper perspectives. The study aims to generate more comprehensive future work suggestions by leveraging related research context.

Method: Used Retrieval-Augmented Generation (RAG) with various LLMs, incorporated LLM feedback mechanism to enhance quality, and introduced LLM-as-a-judge framework for evaluation of novelty, hallucination, and feasibility.

Result: RAG-based approach using GPT-4o mini with LLM feedback mechanism outperformed other methods in both qualitative and quantitative evaluations. Human evaluation was also conducted to assess LLM performance.

Conclusion: The proposed RAG-based approach with LLM feedback effectively generates comprehensive future work suggestions, demonstrating superior performance compared to other methods through robust evaluation frameworks.

Abstract: The Future Work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study. This section serves as a valuable resource for early-career researchers seeking unexplored areas and experienced researchers looking for new projects or collaborations. In this study, we generate future work suggestions from a scientific article. To enrich the generation process with broader insights and reduce the chance of missing important research directions, we use context from related papers using RAG. We experimented with various Large Language Models (LLMs) integrated into Retrieval-Augmented Generation (RAG). We incorporate an LLM feedback mechanism to enhance the quality of the generated content and introduce an LLM-as-a-judge framework for robust evaluation, assessing key aspects such as novelty, hallucination, and feasibility. Our results demonstrate that the RAG-based approach using GPT-4o mini, combined with an LLM feedback mechanism, outperforms other methods based on both qualitative and quantitative evaluations. Moreover, we conduct a human evaluation to assess the LLM as an extractor, generator, and feedback provider.

[71] EQ-Knight: A Memory-Augmented LLM Agent for Strategic Affective Gaming in Debt Recovery

Yunbo Long, Yuhan Liu, Liming Xu, Alexandra Brintrup

Main category: cs.CL

TL;DR: EQ-Knight is an LLM agent that uses emotion memory and game theory to dynamically optimize emotional strategies in credit collection, reducing concession losses by 32% while maintaining recovery rates against dishonest debtors.

Details

Motivation: Current LLM chatbots in financial negotiations over-rely on passive empathy, making them vulnerable to exploitation by dishonest debtors who manipulate conciliatory tactics, leading to revenue leakage and moral hazard.

Method: EQ-Knight integrates emotion memory and game-theoretic reasoning with a Hidden Markov Model (HMM) to track and predict debtor emotional states, analyzing real-time and historical emotional cues to strategically counter negative emotions while preserving relationships.

Result: Experiments show EQ-Knight achieves 32% reduction in concession losses without compromising recovery rates, particularly effective against adversarial debtors who use intimidation and guilt-tripping tactics.

Conclusion: EQ-Knight transforms LLMs from high-risk empathetic chatbots into strategic emotion-defenders that balance emotional intelligence with tactical rigor to enforce accountability and deter exploitation in credit collection.

Abstract: Large language model-based chatbots have enhanced engagement in financial negotiations, but their overreliance on passive empathy introduces critical risks in credit collection. While empathy-driven approaches preserve client satisfaction in benign cases, they fail catastrophically against dishonest debtors–individuals who exploit conciliatory tactics to manipulate terms or evade repayment. Blindly prioritizing “customer experience” in such scenarios leads to creditor vulnerabilities: revenue leakage, moral hazard, and systemic exploitation. To address this, we propose EQ-Knight, an LLM agent that dynamically optimizes emotional strategy to defend creditor interests. Unlike naive empathy-centric bots, EQ-Knight integrates emotion memory and game-theoretic reasoning, powered by a Hidden Markov Model (HMM) to track and predict debtor emotional states. By analyzing both real-time and historical emotional cues, EQ-Knight strategically counters negative emotions (e.g., aggression, feigned distress) while preserving productive debtor relationships. Experiments demonstrate EQ-Knight’s superiority over conventional LLM negotiators: it achieves a 32% reduction in concession losses without compromising recovery rates, particularly in adversarial cases where debtors weaponize negative emotions (e.g., intimidation, guilt-tripping) to coerce concessions. For credit agencies, EQ-Knight transforms LLMs from high-risk “people-pleasers” into strategic emotion-defenders–balancing emotional intelligence with tactical rigor to enforce accountability and deter exploitation.

[72] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Jianhua Liu, Liwen Cao, Yanru Wu, Zijie Zhao, Yang Li

Main category: cs.CL

TL;DR: HGPrompt is a dynamic framework that learns optimal ensemble weights for combining multiple source prompts by maximizing transferability and minimizing gradient conflicts, achieving state-of-the-art performance on VTAB benchmark.

Details

Motivation: Prompt tuning is lightweight but naive aggregation of multiple source prompts overlooks their different contribution potential to target tasks, requiring a smarter ensemble approach.

Method: Proposes HGPrompt with differentiable prompt transferability metric and gradient conflict regularization using Hessian and Fisher Information to match gradient variances.

Result: Extensive experiments on VTAB benchmark demonstrate state-of-the-art performance, validating effective multi-source prompt transfer.

Conclusion: HGPrompt successfully learns optimal ensemble weights through information-theoretic transferability maximization and gradient conflict minimization for superior prompt adaptation.

Abstract: Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task. To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them. Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.

[73] Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

Wenbin Hu, Haoran Li, Huihao Jing, Qi Hu, Ziqian Zeng, Sirui Han, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song

Main category: cs.CL

TL;DR: This paper proposes a reinforcement learning approach to enhance LLM safety and privacy compliance while preserving contextual reasoning capabilities, achieving significant improvements in both legal compliance and general reasoning benchmarks.

Details

Motivation: Current LLM safety mitigation strategies fail to preserve contextual reasoning in risky scenarios, rely heavily on sensitive pattern matching, and overlook established safety/privacy standards, creating systemic legal compliance risks.

Method: Formulates safety/privacy issues as contextualized compliance problems using Contextual Integrity theory, aligns with GDPR/EU AI Act/HIPAA standards, and employs reinforcement learning with rule-based rewards to incentivize contextual reasoning while enhancing compliance.

Result: Achieves +8.58% accuracy improvement in safety/privacy benchmarks, enhances general reasoning capabilities with +2.05% on MMLU and +8.98% on LegalBench for OpenThinker-7B model.

Conclusion: The proposed method successfully addresses the limitations of current approaches by significantly improving legal compliance while simultaneously enhancing general reasoning capabilities through contextualized compliance framework and RL training.

Abstract: While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +8.58% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.

[74] MiniCPM4: Ultra-Efficient LLMs on End Devices

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengda Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Baoxi Ji, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Xin Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Lushi Pu, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Xiaoyue Xu, Yukun Yan, Jiarui Yuan, Jinqian Zhang, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Chuyue Zhou, Ge Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

Main category: cs.CL

TL;DR: MiniCPM4 is an efficient LLM for end-side devices with innovations in architecture, data, training algorithms, and inference systems, available in 0.5B and 8B versions, outperforming similar-sized models.

Details

Motivation: To create highly efficient large language models specifically designed for end-side devices through systematic optimization across multiple dimensions.

Method: Proposed InfLLM v2 sparse attention, UltraClean data filtering, UltraChat v2 dataset, ModelTunnel v2 pre-training search, improved post-training methods, and CPM.cu inference system with sparse attention and quantization.

Result: MiniCPM4 achieves satisfactory performance with only 8 trillion training tokens and shows significant speed improvements on long sequence tasks, outperforming similar-sized open-source models.

Conclusion: The systematic innovations across architecture, data, training, and inference enable efficient on-device LLMs that deliver strong performance while maintaining computational efficiency.

Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Furthermore, we construct a hybrid reasoning model, MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform similar-sized open-source models across benchmarks, with the 8B variants showing significant speed improvements on long sequence understanding and generation.

[75] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

Main category: cs.CL

TL;DR: MEDUSA is a multimodal framework that won 1st place in speech emotion recognition by addressing class imbalance and emotion ambiguity through a four-stage training pipeline with ensemble classifiers and meta-classifier optimization.

Details

Motivation: Speech Emotion Recognition (SER) is challenging due to the subjective nature of human emotions and uneven representation under naturalistic conditions, requiring solutions for class imbalance and emotion ambiguity.

Method: Four-stage training pipeline: 1) trains ensemble classifiers using DeepSER (deep cross-modal transformer fusion from pretrained acoustic/linguistic representations), 2) employs Manifold MixUp for regularization, 3-4) optimizes trainable meta-classifier that combines ensemble predictions with human annotation scores as soft targets, balanced data sampling, and multitask learning.

Result: MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

Conclusion: The proposed multimodal framework effectively handles class imbalance and emotion ambiguity through its innovative training approach, demonstrating state-of-the-art performance in naturalistic speech emotion recognition.

Abstract: SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

[76] DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou, Denis Cavallucci, Hicham Chibane

Main category: cs.CL

TL;DR: DAPFAM is a new patent retrieval benchmark with explicit domain partitions showing a 5x performance gap between in-domain and out-of-domain retrieval, with passage-level methods outperforming document-level but no method closing the cross-domain gap.

Details

Motivation: Existing patent retrieval benchmarks lack explicit domain partitions, making it difficult to assess how systems handle cross-technological boundary retrieval challenges.

Method: Created DAPFAM benchmark with 1,247 query families and 45,336 target families using IPC3 overlap scheme for domain partitioning. Conducted 249 experiments with lexical (BM25) and dense (transformer) methods, document/passage retrieval, multiple representations, aggregation strategies, and RRF fusion.

Result: Found pronounced domain gap: out-domain performance remains ~5x lower than in-domain across all configurations. Passage-level retrieval outperforms document-level, dense methods provide modest gains over BM25, but none close the out-domain gap. RRF offers good effectiveness-efficiency trade-offs.

Conclusion: DAPFAM exposes persistent cross-domain retrieval challenges and provides a reproducible testbed for developing more robust patent IR systems, with the dataset publicly available.

Abstract: Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

[77] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

Yuqi Tang, Kehua Feng, Yunfeng Wang, Zhiwen Chen, Chengfei Lv, Gang Yu, Qiang Zhang, Keyan Ding

Main category: cs.CL

TL;DR: Proposes an efficient multi-turn dialogue evaluator that aggregates multiple LLM judges’ preference knowledge into a single model, reducing computational costs while maintaining evaluation quality.

Details

Motivation: Current LLM-as-a-judge evaluation methods suffer from biases and reliability issues, while multi-judge approaches are computationally expensive during inference.

Method: Aggregates preference knowledge from multiple LLM judges into a single model to capture collective wisdom while reducing evaluation overhead.

Result: Outperforms existing baselines across seven dialogue evaluation benchmarks in both single rating and pairwise comparison scenarios.

Conclusion: The method provides efficient and robust dialogue quality assessment by preserving multi-judge advantages while drastically reducing computational costs.

Abstract: Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the “LLM-as-a-judge” paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

[78] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

Mohamed Basem, Islam Oshallah, Ali Hamdi, Khaled Shaban, Hozaifa Kassab

Main category: cs.CL

TL;DR: Novel two-stage framework for Quranic QA combining ensemble Arabic language models for retrieval and instruction-tuned LLMs with few-shot prompting for answer extraction, achieving state-of-the-art results.

Details

Motivation: Address unique challenges of Quranic QA due to linguistic complexity of Classical Arabic and semantic richness of religious texts in low-resource settings.

Method: Two-stage framework: 1) Ensemble fine-tuned Arabic language models for superior passage retrieval, 2) Instruction-tuned large language models with few-shot prompting for answer extraction to overcome small dataset limitations.

Result: Achieved state-of-the-art on Quran QA 2023 Shared Task: MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, pAP@10 of 0.669 for extraction, substantially outperforming previous methods.

Conclusion: Combining model ensembling and instruction-tuned language models effectively addresses low-resource question answering challenges in specialized domains like religious texts.

Abstract: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.

[79] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

Main category: cs.CL

TL;DR: LMTransplant is a novel text augmentation method that uses LLMs to create diverse content-level variants by transplanting seed text into expanded contexts and regenerating new versions, outperforming traditional methods.

Details

Motivation: Traditional data augmentation methods focus on lexical rephrasing with same semantics, while LLM-based approaches struggle with controlling style and structure. There's a need for methods that can create more diverse and creative content-level variations while preserving core text attributes.

Method: LMTransplant uses a transplant-then-regenerate paradigm: incorporating seed text into a context expanded by LLM, then asking the LLM to regenerate a variant based on the expanded context to leverage LLM knowledge for diverse content creation.

Result: LMTransplant demonstrates superior performance over existing text augmentation methods across various text-related tasks and shows exceptional scalability as augmented data size grows.

Conclusion: The proposed LMTransplant paradigm effectively addresses limitations of traditional augmentation methods by leveraging LLMs’ knowledge to generate diverse content-level variants while maintaining original text attributes, showing strong performance and scalability.

Abstract: Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.

[80] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts–Extended Version

Nghiem Thanh Pham, Tung Kieu, Duc-Manh Nguyen, Son Ha Xuan, Nghia Duong-Trung, Danh Le-Phuoc

Main category: cs.CL

TL;DR: SLM-Bench is the first comprehensive benchmark for Small Language Models that evaluates 15 SLMs across 9 NLP tasks using 23 datasets, measuring accuracy, computational efficiency, and sustainability metrics on 4 hardware configurations.

Details

Motivation: Small Language Models offer computational efficiency and accessibility, but there was no systematic evaluation framework to assess their performance and environmental impact comprehensively.

Method: Developed SLM-Bench benchmark that evaluates 15 SLMs on 9 NLP tasks using 23 datasets across 14 domains. Uses 4 hardware configurations and quantifies 11 metrics across correctness, computation, and consumption dimensions with controlled hardware conditions for fair comparisons.

Result: The evaluation reveals diverse trade-offs among SLMs - some excel in accuracy while others achieve superior energy efficiency. The benchmark provides rigorous comparisons of model effectiveness across multiple dimensions.

Conclusion: SLM-Bench sets a new standard for SLM evaluation by bridging the gap between resource efficiency and real-world applicability, with an open-source benchmarking pipeline for reproducibility and further research.

Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.

[81] Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji

Main category: cs.CL

TL;DR: Spotlight Attention uses non-linear hashing to optimize KV cache selection in LLMs, achieving 5x shorter hash codes and 3x higher throughput than traditional methods.

Details

Motivation: Existing KV cache reduction methods use inefficient linear hashing due to orthogonal query-key distributions in narrow cones, requiring better optimization.

Method: Non-linear hashing functions to optimize embedding distribution, with Bradley-Terry ranking-based loss training framework that runs on 16GB GPUs in 8 hours.

Result: 5x shorter hash codes, under 100μs retrieval for 512K tokens on A100 GPU, and 3x higher end-to-end throughput compared to vanilla decoding.

Conclusion: Spotlight Attention significantly improves KV cache efficiency through optimized non-linear hashing, enabling faster LLM inference with maintained performance.

Abstract: Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.

[82] Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao

Main category: cs.CL

TL;DR: IMAGINE is a framework that generates jailbreak-like instructions using embedding space analysis to address distributional gaps in LLM safety alignment, reducing attack success rates without compromising utility.

Details

Motivation: LLMs remain vulnerable to jailbreak attacks due to distributional mismatch between safety training data and real-world malicious instructions, forcing reactive patching cycles.

Method: Leverages embedding space distribution analysis to synthesize jailbreak-like instructions through iterative optimization that dynamically evolves text generation distributions.

Result: Significant decreases in attack success rates on Qwen2.5, Llama3.1, and Llama3.2 models without compromising their utility.

Conclusion: IMAGINE effectively addresses the distributional gap in LLM safety alignment through synthetic data generation, providing proactive defense against jailbreak attacks.

Abstract: Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs’ inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

[83] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge, Mai Nishimura, Jiaxin Ma

Main category: cs.CL

TL;DR: DGPO enables compact language models to achieve sophisticated agentic RAG behaviors through distillation-guided policy optimization, overcoming sparse rewards and poor reasoning in small models.

Details

Motivation: Compact language models (e.g., 0.5B parameters) struggle with agentic RAG behaviors due to poor reasoning ability, sparse rewards, and unstable training, making them unsuitable for resource-constrained environments.

Method: Proposes Distillation-Guided Policy Optimization (DGPO) with cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. Introduces Agentic RAG Capabilities (ARC) metric for systematic evaluation.

Result: DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming larger teacher models in some cases. Comprehensive experiments demonstrate effectiveness.

Conclusion: DGPO makes agentic RAG feasible in computing resource-constrained environments by successfully addressing the challenges of training compact language models for complex agentic behaviors.

Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

[84] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung, Agustin Garcinuno, Spencer Mateega

Main category: cs.CL

TL;DR: UI-Bench is the first large-scale benchmark for evaluating AI text-to-app tools through expert pairwise comparisons of 300 generated sites from 10 tools, providing calibrated rankings and establishing standards for AI-driven web design.

Details

Motivation: There are no public benchmarks that rigorously verify claims of AI text-to-app tools promising high-quality applications and websites in minutes, creating a need for standardized evaluation.

Method: Created UI-Bench with 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments using pairwise comparison and a TrueSkill-derived model for calibrated confidence intervals.

Result: Established a reproducible standard for advancing AI-driven web design with complete prompt set, open-source evaluation framework, and public leaderboard available at uibench.ai/leaderboard.

Conclusion: UI-Bench provides the first comprehensive benchmark for evaluating visual excellence of AI text-to-app tools, enabling reproducible comparison and advancement of AI-driven web design technologies.

Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.

[85] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation

Seganrasan Subramanian, Abhigya Verma

Main category: cs.CL

TL;DR: A framework for generating synthetic long-context datasets using LLMs through prompt-based interactions to address the lack of diverse, verifiable long-text data for training and evaluation.

Details

Motivation: Progress in long-context LLM capabilities is constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation purposes.

Method: Modular framework using prompt-based interaction with LLMs to generate synthetic data, supporting multiple objectives (SFT, DPO, GRPO) through four generation paradigms: multi-turn dialogues, document-grounded pairs, verifiable tasks, and long-context reasoning examples.

Result: The approach enables scalable, controllable, and purpose-aligned dataset creation through templated prompting, model-agnostic architecture, and metadata-enriched outputs.

Conclusion: This framework facilitates advancing long-context capabilities in LLMs by providing a systematic method for generating high-quality synthetic datasets that address current data limitations.

Abstract: The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.

[86] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression

Wei Huang, Huang Wei, Yinggui Wang

Main category: cs.CL

TL;DR: DaMoC framework helps select optimal LLMs for fine-tuning by compressing data and models, saving 20x training time.

Details

Motivation: LLMs struggle with domain-specific tasks and selecting the best model for fine-tuning is challenging due to many available open-source models.

Method: Two-level approach: 1) Data compression through systematic data filtering, token compression, and iterative text rewriting; 2) Model compression using layer similarity scoring and sparse merging to preserve original capabilities.

Result: Extensive experiments on medical Q&A, financial Q&A, general Q&A, and reading comprehension datasets show successful optimal LLM selection with approximately 20-fold training time savings.

Conclusion: DaMoC framework effectively addresses the challenge of selecting optimal LLMs for domain-specific fine-tuning while significantly reducing computational costs.

Abstract: Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer’s importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model’s capability as possible. Extensive experiments on four datasets, medical Q&A, financial Q&A, general Q&A, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.

[87] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: New training stage with bidirectional generative reconstruction tasks (EBQ2D and EBD2Q) improves LLM text embeddings by enriching final token semantics, achieving SOTA results on MTEB benchmark.

Details

Motivation: Existing LLM-based text embedding approaches use final token embeddings (like [EOS]) that weren't intentionally trained to capture whole context semantics, limiting performance in retrieval and re-ranking tasks.

Method: Adds a new training stage before contrastive learning that uses bidirectional generative reconstruction tasks: EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query) to anchor [EOS] embedding and reconstruct Query-Document pairs.

Result: Significantly improves LLM performance on Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Conclusion: The proposed bidirectional generative reconstruction training stage effectively enriches final token semantics, making LLMs more powerful text embedders for retrieval and re-ranking applications.

Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

cs.CV

[88] Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Shengkai Sun, Zefan Zhang, Jianfeng Dong, Zhiyong Cheng, Xiaojun Chang, Meng Wang

Main category: cs.CV

TL;DR: GFP framework replaces low-level reconstruction with high-level feature prediction for efficient masked skeleton modeling, achieving faster training and state-of-the-art performance.

Details

Motivation: Existing masked autoencoder approaches for skeleton-based action recognition use simple reconstruction targets like raw joint coordinates, causing computational redundancy and limited semantic representation.

Method: Proposes General Feature Prediction (GFP) framework with collaborative learning where a lightweight target generation network produces diversified supervision signals across spatial-temporal hierarchies, using constrained optimization to ensure feature diversity.

Result: Achieves 6.2x faster training than standard methods and state-of-the-art performance on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD datasets.

Conclusion: High-level feature prediction spanning from local motion to global semantics is more effective than low-level reconstruction for efficient and high-quality skeleton-based action representation learning.

Abstract: Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.

[89] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

Seungho Choe, Xiaoli Qin, Abubakr Shafique, Amanda Dy, Dimitri Androutsos, Susan Done, April Khademi

Main category: cs.CV

TL;DR: AI-based mitosis detection using teacher-student UNet with domain generalization for robust segmentation and classification across different domains and imbalanced data.

Details

Motivation: Manual mitosis counting is time-consuming and suffers from inter-observer variability. AI solutions face challenges with domain shift (organ/species/staining differences) and severe class imbalance between mitotic figures and normal nuclei.

Method: Pixel-level segmentation approach with teacher-student UNet model integrating contrastive representation learning and domain-adversarial training. Uses pseudo-masks for mitoses, hard negatives, and normal nuclei. Multi-scale CNN classifier with multi-task learning for atypical mitosis classification.

Result: Achieved F1 score of 0.7660 for mitosis detection (Track 1) and balanced accuracy of 0.8414 for atypical mitosis classification (Track 2) on preliminary test set.

Conclusion: The unified framework combining segmentation-based detection and classification effectively addresses domain shift and data imbalance challenges, demonstrating robust performance for automated mitosis analysis.

Abstract: Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.

[90] Multi Attribute Bias Mitigation via Representation Learning

Rajeev Ranjan Dwivedi, Ankur Kumar, Vinod K Kurmi

Main category: cs.CV

TL;DR: GMBM is a two-stage framework that addresses multiple overlapping biases in vision models by first learning to recognize biases through adaptive encoders, then pruning bias directions from gradients, resulting in improved worst-group accuracy and reduced bias amplification.

Details

Motivation: Real-world images contain multiple overlapping biases that collectively impair vision model performance and fairness. Individual bias mitigation is inadequate as it often intensifies other biases.

Method: Two-stage framework: 1) Adaptive Bias Integrated Learning trains encoders for each attribute and integrates them with main backbone to explicitly recognize biases; 2) Gradient Suppression Fine Tuning prunes bias directions from backbone’s gradients. Also introduces Scaled Bias Amplification metric to measure bias.

Result: Improved worst group accuracy, halved multi-attribute bias amplification, and set new low in SBA metric on FB CMNIST, CelebA, and COCO datasets, even with increasing bias complexity and distribution shifts.

Conclusion: GMBM is the first practical, end-to-end multi-bias solution for visual recognition that effectively handles multiple overlapping biases while requiring group labels only during training.

Abstract: Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone’s gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: http://visdomlab.github.io/GMBM/

[91] Lightweight image segmentation for echocardiography

Anders Kjelsrud, Lasse Løvstakken, Erik Smistad, Håvard Dalen, Gilles Van De Vyver

Main category: cs.CV

TL;DR: Lightweight U-Net achieves equivalent cardiac segmentation performance to nnU-Net with 16x smaller size and 4x faster speed through optimized components.

Details

Motivation: Enable real-time clinical measurements from echocardiography by creating a more efficient segmentation model than large, slow nnU-Net configurations.

Method: Ablation study identifying most effective nnU-Net components (data augmentation, architecture, loss functions, post-processing), then developing lightweight U-Net with 2M parameters.

Result: Achieved statistically equivalent Dice scores (0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA) to nnU-Net while being 16x smaller and 4x faster (1.35ms vs 5.40ms per frame).

Conclusion: Simple affine augmentations and deep supervision are key drivers of performance, enabling efficient real-time cardiac segmentation without sacrificing accuracy.

Abstract: Accurate segmentation of the left ventricle in echocardiography can enable fully automatic extraction of clinical measurements such as volumes and ejection fraction. While models configured by nnU-Net perform well, they are large and slow, thus limiting real-time use. We identified the most effective components of nnU-Net for cardiac segmentation through an ablation study, incrementally evaluating data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Our analysis revealed that simple affine augmentations and deep supervision drive performance, while complex augmentations and large model capacity offer diminishing returns. Based on these insights, we developed a lightweight U-Net (2M vs 33M parameters) that achieves statistically equivalent performance to nnU-Net on CAMUS (N=500) with Dice scores of 0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA ($p>0.05$), while being 16 times smaller and 4 times faster (1.35ms vs 5.40ms per frame) than the default nnU-Net configuration. Cross-dataset evaluation on an internal dataset (N=311) confirms comparable generalization.

[92] treeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds

Josafat-Mattias Burmeister, Andreas Tockner, Stefan Reder, Markus Engel, Rico Richter, Jan-Peter Mund, Jürgen Döllner

Main category: cs.CV

TL;DR: Revised treeX algorithm for unsupervised tree instance segmentation from laser scanning data, offering improved runtime and accuracy with parameter presets for ground-based and UAV-borne scanning.

Details

Motivation: Deep learning methods for tree instance segmentation require large annotated datasets and substantial computational resources, creating need for resource-efficient alternatives.

Method: Unsupervised method combining clustering-based stem detection with region growing for crown delineation, with two parameter presets for different laser scanning types (ground-based and UAV-borne).

Result: Achieves F1-score gains of +0.11 to +0.49 for ground-based data compared to original treeX, and F1-score of 0.58 for ULS data where original algorithm failed. Matches accuracy of recent open-source deep learning methods for TLS and PLS data.

Conclusion: Provides resource-efficient alternative to deep learning for tree segmentation and enables semi-automatic label generation for training deep learning models, with open-source Python implementation available.

Abstract: Close-range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource-efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering-based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground-based laser scanning (stationary terrestrial - TLS and PLS), and one for UAV-borne laser scanning (ULS). We evaluated the method on six public datasets (FOR-instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open-source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F$_1$-score gains of +0.11 to +0.49 for ground-based data. For ULS data, our preset achieves an F$_1$-score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open-source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource-efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi-automatic generation of labels for deep learning models. To enable broader adoption, we provide an open-source Python implementation in the pointtree package.

[93] Human Motion Video Generation: A Survey

Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, Fei Ma, Zhiyong Wu, Changpeng Yang, Zonghong Dai, Fei Richard Yu

Main category: cs.CV

TL;DR: Comprehensive survey on human motion video generation covering 10+ sub-tasks, 5 generation phases, and 200+ papers, with first discussion of LLM potential in this field.

Details

Motivation: Existing surveys focus on individual methods but lack comprehensive overview of the entire generative process for human motion video generation.

Method: In-depth survey analyzing human motion video generation through five key phases: input, motion planning, motion video generation, refinement, and output across vision, text, and audio modalities.

Result: Covers over 200 papers and provides thorough overview of latest developments, technological trends, and milestone works in the field.

Conclusion: This survey unveils prospects of human motion video generation and serves as valuable resource for advancing comprehensive applications of digital humans, with first discussion of LLM potential in this domain.

Abstract: Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.

[94] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

Hongpei Zheng, Lintao Xiang, Qijun Yang, Qian Lin, Hujun Yin

Main category: cs.CV

TL;DR: Reg3D introduces a reconstructive geometry instruction tuning framework that uses dual-supervision with 3D geometric information as both input and learning targets to improve 3D scene understanding in multimodal models.

Details

Motivation: Existing Large Multimodal Models excel at 2D visual understanding but struggle with 3D scene understanding due to reliance on text-only supervision, which lacks geometric constraints needed for robust 3D spatial representations.

Method: Dual-supervision paradigm with complementary object-level and frame-level reconstruction tasks in a dual-encoder architecture, enforcing geometric consistency to develop spatial reasoning capabilities.

Result: Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D show substantial performance improvements over existing methods.

Conclusion: Reg3D establishes a new training paradigm for spatially aware multimodal models by incorporating geometry-aware supervision directly into the training process.

Abstract: The rapid development of Large Multimodal Models (LMMs) has led to remarkable progress in 2D visual understanding; however, extending these capabilities to 3D scene understanding remains a significant challenge. Existing approaches predominantly rely on text-only supervision, which fails to provide the geometric constraints required for learning robust 3D spatial representations. In this paper, we introduce Reg3D, a novel Reconstructive Geometry Instruction Tuning framework that addresses this limitation by incorporating geometry-aware supervision directly into the training process. Our key insight is that effective 3D understanding necessitates reconstructing underlying geometric structures rather than merely describing them. Unlike existing methods that inject 3D information solely at the input level, Reg3D adopts a dual-supervision paradigm that leverages 3D geometric information both as input and as explicit learning targets. Specifically, we design complementary object-level and frame-level reconstruction tasks within a dual-encoder architecture, enforcing geometric consistency to encourage the development of spatial reasoning capabilities. Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D demonstrate that Reg3D delivers substantial performance improvements, establishing a new training paradigm for spatially aware multimodal models.

[95] TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Yaru Chen, Faegheh Sardari, Peiliang Zhang, Ruohao Guo, Yang Xiang, Zhenbo Li, Wenwu Wang

Main category: cs.CV

TL;DR: Proposes BiT and CATS modules to address error amplification in Audio-Visual Video Parsing by combining semantic purification with temporal graph propagation for better event localization.

Details

Motivation: Existing AVVP methods either rely on noisy segment-level pseudo labels as reliable supervision or spread indiscriminate attention across all frames, causing initial errors to be repeatedly amplified during training.

Method: Combines Bi-Directional Text Fusion (BiT) module for semantic injection and dynamic calibration of audio/visual features, and Category-Aware Temporal Graph (CATS) module for semantic propagation and connection across time.

Result: Achieves state-of-the-art performance on multiple key indicators across two benchmark datasets (LLP and UnAV-100).

Conclusion: The proposed method effectively integrates strengths of previous research directions to locate and purify cleaner semantic cues while enabling precise semantic information dissemination across time.

Abstract: Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

[96] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Seth Z. Zhao, Huizhi Zhang, Zhaowei Li, Juntong Peng, Anthony Chui, Zewei Zhou, Zonglin Meng, Hao Xiang, Zhiyu Huang, Fujia Wang, Ran Tian, Chenfeng Xu, Bolei Zhou, Jiaqi Ma

Main category: cs.CV

TL;DR: QuantV2X is the first fully quantized multi-agent system for V2X cooperative perception that reduces computational/transmission costs while maintaining accuracy comparable to full-precision systems.

Details

Motivation: Existing V2X cooperative perception systems focus on accuracy but ignore efficiency, latency, and real-world deployability issues caused by high computational/transmission costs of full-precision models.

Method: Introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations to reduce computational load and transmission bandwidth.

Result: Achieves accuracy comparable to full-precision systems while reducing system-level latency by 3.2x and improving mAP30 by +9.5 over full-precision baselines. Enables larger models within strict memory budgets.

Conclusion: QuantV2X demonstrates the viability of fully quantized multi-agent intermediate fusion systems for real-world V2X deployment and will be publicly released to promote research.

Abstract: Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.

[97] TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

Main category: cs.CV

TL;DR: TRUST-VL is a unified vision-language model for multimodal misinformation detection that achieves state-of-the-art performance through joint training across distortion types and a novel Question-Aware Visual Amplifier module.

Details

Motivation: Multimodal misinformation (textual, visual, cross-modal distortions) amplified by generative AI poses increasing societal threats, and existing methods focus on single distortion types and struggle with generalization to unseen scenarios.

Method: Introduces TRUST-VL with Question-Aware Visual Amplifier for task-specific visual features, trained on TRUST-Instruct dataset (198K samples) with structured reasoning chains aligned with human fact-checking workflows.

Result: Achieves state-of-the-art performance on both in-domain and zero-shot benchmarks, demonstrating strong generalization and interpretability capabilities.

Conclusion: Joint training across distortion types facilitates knowledge sharing and enhances generalization, making TRUST-VL an effective unified solution for multimodal misinformation detection with explainable reasoning.

Abstract: Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

[98] Transfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns

Bandita Bharadwaj, Ankur Mishra, Saurav Bharadwaj

Main category: cs.CV

TL;DR: EfficientNetB0 outperforms ResNet50 and MobileNetV2 for plant species classification using leaf venation patterns, achieving 94.67% testing accuracy with excellent generalization capabilities.

Details

Motivation: To evaluate deep learning architectures for automated plant species classification based on leaf venation patterns, which are critical morphological features with high taxonomic relevance.

Method: Used three deep learning models (ResNet50, MobileNetV2, EfficientNetB0) on the Swedish Leaf Dataset with 1,125 images from 15 species, evaluated using standard performance metrics during training and testing phases.

Result: ResNet50: 94.11% training accuracy but overfitted (88.45% testing, 87.82% F1). MobileNetV2: 93.34% testing accuracy, 93.23% F1. EfficientNetB0: 94.67% testing accuracy with precision, recall, and F1 scores all exceeding 94.6%.

Conclusion: Deep learning, particularly EfficientNetB0, shows strong potential for developing scalable and accurate automated plant taxonomy tools using venation traits, with EfficientNetB0 demonstrating superior robustness and performance.

Abstract: This study evaluates the efficacy of three deep learning architectures: ResNet50, MobileNetV2, and EfficientNetB0 for automated plant species classification based on leaf venation patterns, a critical morphological feature with high taxonomic relevance. Using the Swedish Leaf Dataset comprising images from 15 distinct species (75 images per species, totalling 1,125 images), the models were demonstrated using standard performance metrics during training and testing phases. ResNet50 achieved a training accuracy of 94.11% but exhibited overfitting, reflected by a reduced testing accuracy of 88.45% and an F1 score of 87.82%. MobileNetV2 demonstrated better generalization capabilities, attaining a testing accuracy of 93.34% and an F1 score of 93.23%, indicating its suitability for lightweight, real-time applications. EfficientNetB0 outperformed both models, achieving a testing accuracy of 94.67% with precision, recall, and F1 scores exceeding 94.6%, highlighting its robustness in venation-based classification. The findings underscore the potential of deep learning, particularly EfficientNetB0, in developing scalable and accurate tools for automated plant taxonomy using venation traits.

[99] LayoutGKN: Graph Similarity Learning of Floor Plans

Casper van Engelenburg, Jan van Gemert, Seyran Khademi

Main category: cs.CV

TL;DR: LayoutGKN is an efficient graph comparison method for floor plans that postpones cross-graph interactions to the end, using differentiable graph kernels for faster inference while maintaining comparable performance to graph matching networks.

Details

Motivation: Graph matching networks for floor plan comparison rely on costly intermediate cross-graph node-level interactions, making them slow during inference time.

Method: LayoutGKN postpones cross-graph node-level interactions to the end of the joint embedding architecture and uses a differentiable graph kernel as a distance function on the final learned node-level embeddings.

Result: LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed of inference.

Conclusion: The proposed LayoutGKN approach provides a more efficient alternative to traditional graph matching networks for floor plan graph comparison, achieving similar or better performance with significantly faster inference times.

Abstract: Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.

[100] Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model

Dizhan Xue, Shengsheng Qian, Chuanrui Hu, Changsheng Xu

Main category: cs.CV

TL;DR: This paper introduces a new Short-video Propagation Influence Rating (SPIR) task and proposes both a large cross-platform dataset (XS-Video) and a novel Large Graph Model (NetGPT) for predicting short-video propagation influence.

Details

Motivation: Short-video platforms have massive global popularity, and analyzing short-video propagation is crucial for understanding commercial values, public opinions, and user behaviors. However, there's a lack of large-scale cross-platform datasets and effective methods for propagation influence rating.

Method: The paper proposes: 1) XS-Video dataset - 117,720 videos, 381,926 samples across 5 Chinese platforms with propagation influence annotations (levels 0-9); 2) NetGPT - a Large Graph Model using a three-stage training mechanism to bridge graph-structured data with LLMs for propagation analysis.

Result: Comprehensive experiments on the XS-Video dataset using both classification and regression metrics show that the proposed NetGPT method achieves superior performance for the SPIR task.

Conclusion: This work establishes the first large-scale cross-platform short-video dataset and demonstrates the effectiveness of the NetGPT model for predicting short-video propagation influence, advancing research in short-video analysis and propagation modeling.

Abstract: Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate the research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR.

[101] Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat, Hassan Rivaz, Yiming Xiao

Main category: cs.CV

TL;DR: CLIP-SVD is a parameter-efficient adaptation method that uses Singular Value Decomposition to fine-tune only 0.04% of CLIP’s parameters for better domain adaptation while preserving generalization.

Details

Motivation: Adapting vision-language models like CLIP to new domains is challenging due to reliance on prompt engineering and high cost of full fine-tuning, with existing methods compromising pretrained knowledge.

Method: Uses SVD to modify CLIP’s internal parameter space by fine-tuning only singular values to rescale basis vectors for domain adaptation, without adding new modules.

Result: Achieves state-of-the-art classification on 11 natural and 10 biomedical datasets, outperforming previous methods in accuracy and generalization under few-shot settings.

Conclusion: CLIP-SVD provides an effective, parameter-efficient adaptation technique that preserves model generalization while enabling interpretability through language-based analysis.

Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04%} of the model’s total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

[102] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Zongsen Qiu

Main category: cs.CV

TL;DR: Proposed STA-Net with Shape-Texture Attention Module (STAM) for efficient plant disease diagnosis on edge devices, achieving 89% accuracy with only 401K parameters.

Details

Motivation: Address the challenge of deploying high-precision plant disease diagnosis models on edge devices, as existing attention mechanisms designed for generic object recognition fail to capture subtle pathological features like irregular lesion shapes and complex textures.

Method: Twofold approach: 1) Training-free neural architecture search (DeepMAD) for efficient network backbone; 2) Shape-Texture Attention Module (STAM) with two branches - deformable convolutions (DCNv4) for shape awareness and Gabor filter bank for texture awareness.

Result: On CCMT plant disease dataset: 89.00% accuracy and 88.96% F1 score with only 401K parameters and 51.1M FLOPs. STAM significantly outperforms baseline and standard attention models in ablation studies.

Conclusion: Integrating domain knowledge through decoupled attention (shape and texture) provides an effective approach for edge-deployed precision agriculture AI systems.

Abstract: Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches – one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

[103] From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Xue Li, Jameson Merkow, Noel C. F. Codella, Alberto Santamaria-Pang, Naiteek Sangani, Alexander Ersoy, Christopher Burt, John W. Garrett, Richard J. Bruce, Joshua D. Warner, Tyler Bradshaw, Ivan Tarapov, Matthew P. Lungren, Alan B. McMillan

Main category: cs.CV

TL;DR: MedImageInsight embeddings with SVM/MLP adapters achieved 93.1% mAUC for radiography classification, outperforming other foundation models while being computationally efficient and equitable across patient demographics.

Details

Motivation: To evaluate foundation model embeddings for medical imaging tasks and develop lightweight, efficient adapters that can be practically deployed in clinical settings for multi-class radiography classification.

Method: Used embeddings from 7 foundation models (general and medical-specific) to train lightweight adapters (KNN, logistic regression, SVM, random forest, MLP) on 8,842 radiographs across 7 classes, with performance evaluation and fairness analysis.

Result: MedImageInsight embeddings with SVM/MLP achieved highest mAUC of 93.1%, statistically superior to other models. Lightweight adapters trained in minutes and performed inference in seconds on CPU. Minimal performance disparities across gender (within 1.8%) and age groups (std. dev < 1.4%).

Conclusion: Specialized foundation model embeddings, particularly MedImageInsight, enable accurate, efficient, and equitable diagnostic tools using simple lightweight adapters, making them practical for clinical deployment.

Abstract: Foundation models provide robust embeddings for diverse tasks, including medical imaging. We evaluate embeddings from seven general and medical-specific foundation models (e.g., DenseNet121, BiomedCLIP, MedImageInsight, Rad-DINO, CXR-Foundation) for training lightweight adapters in multi-class radiography classification. Using a dataset of 8,842 radiographs across seven classes, we trained adapters with algorithms like K-Nearest Neighbors, logistic regression, SVM, random forest, and MLP. The combination of MedImageInsight embeddings with an SVM or MLP adapter achieved the highest mean area under the curve (mAUC) of 93.1%. This performance was statistically superior to other models, including MedSigLIP with an MLP (91.0%), Rad-DINO with an SVM (90.7%), and CXR-Foundation with logistic regression (88.6%). In contrast, models like BiomedCLIP (82.8%) and Med-Flamingo (78.5%) showed lower performance. Crucially, these lightweight adapters are computationally efficient, training in minutes and performing inference in seconds on a CPU, making them practical for clinical use. A fairness analysis of the top-performing MedImageInsight adapter revealed minimal performance disparities across patient gender (within 1.8%) and age groups (std. dev < 1.4%), with no significant statistical differences. These findings confirm that embeddings from specialized foundation models, particularly MedImageInsight, can power accurate, efficient, and equitable diagnostic tools using simple, lightweight adapters.

[104] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

Xinxin Wang, Han Sun, Ningzhong Liu, Huiyu Zhou, Yinan Yao

Main category: cs.CV

TL;DR: Introduces DeepCamo dataset and SLENet framework for underwater camouflaged object detection, addressing optical distortions and water turbidity challenges with novel enhancement and localization modules.

Details

Motivation: Underwater camouflaged object detection is critical for marine ecology but remains underexplored due to optical distortions, water turbidity, and complex marine organism traits that hinder accurate identification.

Method: Proposes Semantic Localization and Enhancement Network (SLENet) with Gamma-Asymmetric Enhancement module and Localization Guidance Branch to enhance multi-scale features and generate semantic-rich location maps, guided by Multi-Scale Supervised Decoder.

Result: Experiments on DeepCamo dataset and three benchmark COD datasets show SLENet achieves superior performance over state-of-the-art methods and demonstrates high generality for broader COD tasks.

Conclusion: SLENet effectively addresses underwater camouflaged object detection challenges and establishes a strong benchmark for this domain with promising generalization capabilities.

Abstract: Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet’s superior performance over SOTA methods, and underscore its high generality for the broader COD task.

[105] Towards Controllable Real Image Denoising with Camera Parameters

Youngjin Oh, Junhyeong Kwon, Keuntek Lee, Nam Ik Cho

Main category: cs.CV

TL;DR: A controllable image denoising framework that uses camera parameters (ISO, shutter speed, F-number) to adaptively adjust denoising strength and improve performance.

Details

Motivation: Existing deep learning denoising methods lack flexibility to adjust denoising strength based on noise levels, camera settings, and user preferences.

Method: Convert camera parameters (ISO, shutter speed, F-number) into a vector to control and enhance standard denoising neural networks.

Result: Experimental results show the method seamlessly adds controllability to standard denoising networks and improves their performance.

Conclusion: The framework provides adaptive noise removal using camera parameter information, offering flexible denoising control while enhancing network performance.

Abstract: Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at https://github.com/OBAKSA/CPADNet.

[106] Fitting Image Diffusion Models on Video Datasets

Juhun Lee, Simon S. Woo

Main category: cs.CV

TL;DR: A simple training strategy that leverages temporal coherence in video frames to improve diffusion model training, achieving 2x faster convergence and better performance without architectural changes.

Details

Motivation: Standard diffusion models are trained on static images, which is information-deficient for capturing temporal dynamics, leading to slower convergence, limited distribution coverage, and reduced generalization.

Method: A training strategy that incorporates temporal inductive bias from continuous video frames into diffusion training, requiring no architectural modifications and being seamlessly integrable into standard pipelines.

Result: 2x faster convergence, lower FID scores on both training and validation distributions, improved generative diversity by capturing meaningful temporal variations, and reduced gradient variance for optimization stability.

Conclusion: Leveraging temporal coherence in video data significantly improves diffusion model training efficiency and performance, demonstrating the value of temporal inductive bias even for static image generation tasks.

Abstract: Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2$\text{x}$ faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.

[107] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Yuheng Li, Yenho Chen, Yuxiang Lai, Jike Zhong, Vanessa Wildman, Xiaofeng Yang

Main category: cs.CV

TL;DR: MedVista3D is a multi-scale vision-language framework for 3D CT analysis that addresses radiologic diagnostic errors through joint local-global understanding and semantic-enriched report alignment, achieving SOTA performance across multiple tasks.

Details

Motivation: Radiologic diagnostic errors remain prevalent due to missed localized abnormalities, limited global context in 3D imaging, and variability in report language. Existing 3D vision-language models cannot jointly address precise detection, global reasoning, and consistent reporting needs.

Method: Multi-scale semantic-enriched pretraining framework performing local and global image-text alignment for fine-grained representation learning. Uses language model rewrites and Radiology Semantic Matching Bank for semantics-aware alignment to address report variability.

Result: Achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering. Also transfers well to organ segmentation and prognosis prediction tasks.

Conclusion: MedVista3D effectively addresses key challenges in 3D radiology analysis by enabling joint disease detection and holistic interpretation while handling report variability, demonstrating strong performance across multiple clinical applications.

Abstract: Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

[108] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao, Qiulei Dong

Main category: cs.CV

TL;DR: CaPL is a causality-guided text prompt learning method that uses visual granulation to improve CLIP’s performance on fine-grained recognition tasks by capturing subtle class differences through causal inference.

Details

Motivation: Existing CLIP-based prompt learning methods show limited ability in handling fine-grained datasets, failing to capture subtle discrepancies between similar classes.

Method: Two modules: 1) Attribute disentanglement using Brownian Bridge Diffusion Model to separate shared and class-specific attributes; 2) Granule learning module that constructs visual granules by integrating attributes under two causal inference strategies.

Result: Extensive experiments on 15 datasets show CaPL significantly outperforms state-of-the-art prompt learning methods, especially on fine-grained datasets.

Conclusion: Visual granulation through causal inference enables more discriminative text prompts, effectively addressing fine-grained recognition challenges in CLIP-based models.

Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

[109] EGTM: Event-guided Efficient Turbulence Mitigation

Huanan Li, Rui Fan, Juntao Guan, Weidong Hao, Lai Rui, Tong Wu, Yikai Wang, Lin Gu

Main category: cs.CV

TL;DR: Proposes EGTM framework using event cameras for turbulence mitigation, achieving 710x smaller model size and 214x faster inference while improving restoration quality by +0.94 PSNR compared to state-of-the-art methods.

Details

Motivation: Existing deep-learning turbulence mitigation methods require high-capacity networks to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, resulting in poor computational and storage efficiency.

Method: Presents ’event-lucky insight’ revealing correlation between turbulence distortions and event streams, then proposes EGTM framework that extracts pixel-level turbulence-free guidance from turbulent events for temporal lucky fusion. Also builds first real-world event-driven turbulence dataset.

Result: Significantly surpasses existing SOTA by 710x in model size, 214x in inference latency, and 224x in model complexity while achieving state-of-the-art restoration quality (+0.94 PSNR and +0.08 SSIM) on real-world dataset.

Conclusion: Demonstrates the great efficiency merit of introducing event modality into turbulence mitigation tasks, with event cameras’ microsecond-level temporal resolution fundamentally addressing computational bottlenecks.

Abstract: Turbulence mitigation (TM) aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called “lucky’’, not distorted patch, for “lucky fusion’’. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf{``event-lucky insight’’} to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.

[110] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection

Nan Yang, Yang Wang, Zhanwen Liu, Yuchao Dai, Yang Liu, Xiangmo Zhao

Main category: cs.CV

TL;DR: FocusMamba is a novel RGB-Event detection method that performs adaptive collaborative sparsification of multimodal features to reduce computational costs while maintaining high accuracy by discarding low-information regions guided by event data.

Details

Motivation: Existing RGB-Event detection methods process both high and low-information regions uniformly, leading to computational redundancy and suboptimal performance. Fixed token sparsification methods fail to adapt to varying sample complexity.

Method: Proposes Event-Guided Multimodal Sparsification (EGMS) to identify and adaptively discard low-information regions using event camera data, and Cross-Modality Focus Fusion (CMFF) to efficiently integrate complementary features from both modalities.

Result: Experiments on DSEC-Det and PKU-DAVIS-SOD datasets show superior performance in both accuracy and efficiency compared to existing methods.

Conclusion: FocusMamba achieves better balance between accuracy and efficiency through adaptive collaborative sparsification and effective multimodal fusion, demonstrating state-of-the-art performance in RGB-Event detection tasks.

Abstract: Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.

[111] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition

Jiajun Song, Xiaoou Liu

Main category: cs.CV

TL;DR: Proposes Compositional Zero-Shot Food Recognition (CZSFR) with SalientFusion method to address background redundancy, role confusion, and semantic bias challenges in food recognition.

Details

Motivation: Food recognition needs methods for recognizing unseen food categories (Zero-Shot Food Learning), but faces challenges with background distractions, role confusion between dishes, and semantic bias in attributes.

Method: SalientFusion with two components: SalientFormer (removes background redundancy, uses depth features to resolve role confusion) and DebiasAT (reduces semantic bias by aligning prompts with visual features).

Result: Achieves state-of-the-art results on proposed benchmarks CZSFood-90 and CZSFood-164, as well as popular general datasets for Compositional Zero-Shot Learning.

Conclusion: The proposed method effectively addresses key challenges in zero-shot food recognition and demonstrates superior performance on both food-specific and general benchmarks.

Abstract: Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.

[112] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin

Main category: cs.CV

TL;DR: OccTENS is a generative occupancy world model that uses temporal next-scale prediction (TENS) and TensFormer architecture to enable efficient, high-fidelity long-term 3D occupancy generation with pose controllability.

Details

Motivation: Existing autoregressive approaches for occupancy world modeling suffer from inefficiency, temporal degradation in long-term generation, and lack of controllability, which need to be addressed holistically.

Method: Reformulates occupancy modeling as temporal next-scale prediction (TENS) task, decomposing temporal sequence modeling into spatial scale-by-scale generation and temporal scene-by-scene prediction using TensFormer architecture with holistic pose aggregation strategy.

Result: Outperforms state-of-the-art methods with both higher occupancy quality and faster inference time.

Conclusion: OccTENS provides an effective solution for controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency through its TENS formulation and TensFormer architecture.

Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

[113] Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov, Linan Zhao, Yunzhi Zhang, Daniel L. K. Yamins, Jiajun Wu

Main category: cs.CV

TL;DR: A weakly-supervised method for dense functional correspondence across object categories using vision-language models to pseudo-label functional parts and dense contrastive learning.

Details

Motivation: Object function guides correspondences across different categories since functional parts share shape/appearance similarities, enabling better shape reconstruction and robot manipulation.

Method: Leverage vision-language models to pseudo-label multi-view images for functional parts, then integrate with dense contrastive learning from pixel correspondences to distill functional and spatial knowledge.

Result: Outperforms baseline solutions using off-the-shelf self-supervised image representations and grounded vision language models on synthetic and real evaluation datasets.

Conclusion: The approach successfully establishes dense functional correspondence by combining vision-language pseudo-labeling with contrastive learning, demonstrating advantages over existing methods.

Abstract: Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

[114] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Main category: cs.CV

TL;DR: Attn-Adapter is an online few-shot learning framework that enhances CLIP’s adaptability through dual attention mechanisms, enabling dynamic adaptation from few labeled samples without retraining the base model.

Details

Motivation: Contrastive vision-language models like CLIP excel in zero-shot recognition but struggle with few-shot scenarios due to computationally intensive offline fine-tuning and overfitting risks.

Method: Proposes a dual attention mechanism with Memory Attn-Adapter (refines category embeddings using support examples) and Local-Global Attn-Adapter (enriches image embeddings by integrating local and global features).

Result: Outperforms state-of-the-art methods in cross-category and cross-dataset generalization while maintaining efficient inference and scaling across CLIP backbones.

Conclusion: Attn-Adapter provides an effective online few-shot learning solution that enhances CLIP’s adaptability without retraining, addressing limitations of traditional prompt learning approaches.

Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP’s adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

[115] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva

Main category: cs.CV

TL;DR: SPECS is a new reference-free evaluation metric for long image captions that combines CLIP’s efficiency with improved specificity scoring, matching LLM-based metrics’ correlation with human judgments while being much faster.

Details

Motivation: Existing evaluation metrics for long image captions have limitations: n-gram metrics lack semantic understanding, representational similarity metrics have low human correlation, and LLM-based metrics are too expensive for iterative development.

Method: SPECS modifies CLIP with a new objective that emphasizes specificity - rewarding correct details and penalizing incorrect ones in long image captions, creating a reference-free representational similarity metric.

Result: SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments while being significantly more computationally efficient.

Conclusion: SPECS provides a practical alternative for iterative checkpoint evaluation during image captioning model development, balancing accuracy and efficiency.

Abstract: As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

[116] A Generative Foundation Model for Chest Radiography

Yuanfeng Ji, Dan Lin, Xiyue Wang, Lu Zhang, Wenhui Zhou, Chongjian Ge, Ruihang Chu, Xiaoli Yang, Junhan Zhao, Junsong Chen, Xiangde Luo, Sen Yang, Jin Fang, Ping Luo, Ruijiang Li

Main category: cs.CV

TL;DR: ChexGen is a generative vision-language foundation model for chest X-ray synthesis that uses text, masks, and bounding boxes to generate diverse medical images, trained on 960,000 radiograph-report pairs to improve AI model performance and fairness.

Details

Motivation: Address the scarcity of well-annotated diverse medical images for developing reliable AI models in healthcare by leveraging generative foundation model advances from natural images.

Method: Developed ChexGen using latent diffusion transformer architecture, pretrained on 960,000 chest X-ray and report pairs, enabling text-, mask-, and bounding box-guided synthesis of radiographs.

Result: Achieved accurate radiograph synthesis validated by expert evaluations and quantitative metrics. Demonstrated utility for data augmentation and supervised pretraining, improving performance across disease classification, detection, and segmentation tasks with minimal training data. Enabled creation of diverse patient cohorts to detect and mitigate demographic biases.

Conclusion: Generative foundation models like ChexGen play a transformative role in building more accurate, data-efficient, and equitable medical AI systems.

Abstract: The scarcity of well-annotated diverse medical images is a major hurdle for developing reliable AI models in healthcare. Substantial technical advances have been made in generative foundation models for natural images. Here we develop `ChexGen’, a generative vision-language foundation model that introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. Built upon the latent diffusion transformer architecture, ChexGen was pretrained on the largest curated chest X-ray dataset to date, consisting of 960,000 radiograph-report pairs. ChexGen achieves accurate synthesis of radiographs through expert evaluations and quantitative metrics. We demonstrate the utility of ChexGen for training data augmentation and supervised pretraining, which led to performance improvements across disease classification, detection, and segmentation tasks using a small fraction of training data. Further, our model enables the creation of diverse patient cohorts that enhance model fairness by detecting and mitigating demographic biases. Our study supports the transformative role of generative foundation models in building more accurate, data-efficient, and equitable medical AI systems.

[117] LMVC: An End-to-End Learned Multiview Video Coding Framework

Xihua Sheng, Yingwen Zhang, Long Xu, Shiqi Wang

Main category: cs.CV

TL;DR: Proposes an end-to-end learned multiview video coding framework that leverages inter-view motion and content correlations to significantly improve compression efficiency while maintaining random access and backward compatibility.

Details

Motivation: Multiview video enables immersive 3D scene reconstruction but has massive data volume, creating storage and transmission challenges. While deep learning-based video coding has succeeded for single-view/stereo videos, general multiview scenarios remain underexplored.

Method: Proposes feature-based inter-view motion vector prediction using decoded independent-view motion features, inter-view motion entropy model, disparity-free inter-view context prediction from independent-view content features, and inter-view contextual entropy model.

Result: Outperforms the reference software of traditional MV-HEVC standard by a large margin.

Conclusion: Establishes a strong baseline for future research in learned multiview video coding, demonstrating significant compression efficiency improvements while maintaining essential features like random access and backward compatibility.

Abstract: Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing compression efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view compression. Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on decoded independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from decoded independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.

[118] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

Minghui Zhang, Yaoyu Liu, Junyang Wu, Xin You, Hanxiao Zhang, Junjun He, Yun Gu

Main category: cs.CV

TL;DR: TopoSculpt is a novel framework for topological refinement of 3D tubular anatomical structures that addresses limitations of existing methods by using holistic modeling, topological integrity constraints, and curriculum refinement to achieve significant improvements in both geometry and topology.

Details

Motivation: Existing methods for reconstructing medical tubular structures rely on voxel-wise overlap measures that fail to capture topological correctness and completeness. Current topology-aware approaches are patch-wise and cannot guarantee global preservation or correct geometric errors during inference.

Method: TopoSculpt uses: (i) holistic whole-region modeling to capture full spatial context, (ii) Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales.

Result: Substantial improvements on pulmonary airway and Circle of Willis datasets: β₀ errors reduced from 69.00 to 3.40 on airway dataset and from 1.65 to 0.30 on CoW dataset. Tree length detected and branch detected rates improved by nearly 10%.

Conclusion: TopoSculpt effectively corrects critical topological errors and advances high-fidelity modeling of complex 3D tubular anatomy, demonstrating significant improvements in both geometric accuracy and topological correctness compared to existing methods.

Abstract: Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $\beta_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.

[119] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture

Alvaro Aranibar Roque, Helga Sebastian

Main category: cs.CV

TL;DR: Deep learning U-Net with EfficientNet-B4 encoder achieves accurate pneumothorax segmentation on chest X-rays with IoU 0.7008 and Dice score 0.8241.

Details

Motivation: Pneumothorax can be life-threatening if undetected, and small cases may be subtle on chest X-rays, requiring automated detection to support radiologists.

Method: U-Net architecture with EfficientNet-B4 encoder, trained on SIIM-ACR dataset with data augmentation and combined binary cross-entropy plus Dice loss.

Result: Achieved IoU of 0.7008 and Dice score of 0.8241 on independent PTX-498 dataset, demonstrating accurate localization of pneumothorax regions.

Conclusion: The proposed deep learning pipeline can accurately segment pneumothorax regions and has potential to support radiologists in clinical diagnosis.

Abstract: Pneumothorax, the abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases may be subtle. We propose an automated deep-learning pipeline using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. Trained on the SIIM-ACR dataset with data augmentation and a combined binary cross-entropy plus Dice loss, the model achieved an IoU of 0.7008 and Dice score of 0.8241 on the independent PTX-498 dataset. These results demonstrate that the model can accurately localize pneumothoraces and support radiologists.

[120] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Zhu Wenjie, Zhang Yabin, Xin Jin, Wenjun Zeng, Lei Zhang

Main category: cs.CV

TL;DR: Proposes ANTS method using MLLMs to create adaptive negative textual spaces for improved OOD detection, reducing FPR95 by 4.2% on ImageNet benchmark.

Details

Motivation: Existing OOD detection methods lack understanding of OOD images and suffer from false negative labels, particularly degrading near-OOD performance.

Method: Leverages MLLMs to generate expressive negative sentences from identified OOD samples for far-OOD, and creates visually similar negative labels for near-OOD subsets. Uses adaptive weighted score to balance both types without task-specific knowledge.

Result: Significantly reduces FPR95 by 4.2% on ImageNet benchmark, establishing new state-of-the-art performance.

Conclusion: ANTS provides training-free, zero-shot solution that effectively handles both near and far-OOD detection with high adaptability and scalability in open environments.

Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[121] Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen

Main category: cs.CV

TL;DR: MMChange is a multimodal remote sensing change detection method that combines image and text modalities to improve accuracy and robustness by addressing limitations of image-only approaches.

Details

Motivation: Most existing remote sensing change detection methods rely solely on image modality, which limits feature representation, change pattern modeling, and generalization, especially under illumination and noise disturbances.

Method: Proposes MMChange with three key modules: Image Feature Refinement (IFR) to highlight key regions and suppress noise, Textual Difference Enhancement (TDE) using vision language model to capture semantic shifts from generated descriptions, and Image-Text Feature Fusion (ITFF) to bridge modality heterogeneity through deep cross-modal integration.

Result: Extensive experiments on LEVIRCD, WHUCD, and SYSUCD datasets demonstrate that MMChange consistently surpasses state-of-the-art methods across multiple metrics.

Conclusion: MMChange validates the effectiveness of multimodal approach for remote sensing change detection, achieving superior performance through the integration of image and text modalities.

Abstract: Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

[122] SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification

Yu Bai, Zitong Yu, Haowen Tian, Xijing Wang, Shuo Yan, Lin Wang, Honglin Li, Xitong Ling, Bo Zhang, Zheng Zhang, Wufan Wang, Hui Gao, Xiangyang Gong, Wendong Wang

Main category: cs.CV

TL;DR: SAC-MIL is a novel WSI classification method that uses spatial-aware positional encoding and MLP-based full instance correlation to achieve state-of-the-art performance with linear time complexity.

Details

Motivation: To address the limitations of existing WSI classification methods, particularly the need for better spatial relationship encoding and efficient full instance correlations without requiring complex custom implementations like CUDA kernels.

Method: Uses a positional encoding module that encodes spatial coordinates instead of sequence indices, and an SAC block (MLP-based) that performs full instance correlations in linear time complexity.

Result: Achieved state-of-the-art performance on CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets.

Conclusion: SAC-MIL provides an efficient and effective solution for WSI classification with better spatial awareness and computational efficiency compared to Transformer-based methods.

Abstract: We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.

[123] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training

Daniel Sobotka, Alexander Herold, Matthias Perkonigg, Lucian Beer, Nina Bastati, Alina Sablatnig, Ahmed Ba-Ssalamah, Georg Langs

Main category: cs.CV

TL;DR: Multi-task learning framework for liver vessel segmentation in non-contrast MRI using auxiliary contrast-enhanced data during training to reduce annotation requirements.

Details

Motivation: Liver vessel segmentation is crucial for analyzing vascular changes in liver diseases, but existing methods require contrast-enhanced imaging which is not always available. Non-contrast MRI is more common but challenging to segment without large annotated datasets.

Method: Proposes a multi-task learning framework that leverages paired native and contrast-enhanced MRI data (with and without vessel annotations) during training. The approach uses auxiliary contrast-enhanced data to improve feature representation through shared task structure.

Result: Auxiliary contrast-enhanced data improves vessel segmentation accuracy even when not available during inference. Benefits are most significant when few annotations are available for training. Validation on brain tumor segmentation confirms cross-domain applicability.

Conclusion: An auxiliary informative imaging modality can effectively augment expert annotations and improve segmentation performance, even when only available during training rather than inference.

Abstract: Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.

[124] Promptception: How Sensitive Are Large Multimodal Models to Prompts?

Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, Salman Khan

Main category: cs.CV

TL;DR: Prompt design significantly affects LMM performance in MCQA, with up to 15% accuracy variations from minor prompt changes. Promptception framework evaluates 10 LMMs across 61 prompt types, revealing proprietary models are more sensitive to phrasing while open-source models struggle with complexity.

Details

Motivation: Address the lack of understanding about prompt design for LMMs in MCQA and the challenge of transparent evaluation due to performance variability from prompt phrasing differences.

Method: Developed Promptception framework with 61 prompt types across 15 categories and 6 supercategories to systematically evaluate prompt sensitivity in 10 LMMs across 3 MCQA benchmarks (MMStar, MMMU-Pro, MVBench).

Result: Proprietary models show greater sensitivity to prompt phrasing (indicating better instruction alignment), while open-source models are steadier but struggle with nuanced and complex phrasing.

Conclusion: Proposed tailored Prompting Principles for both proprietary and open-source LMMs to enable more robust and fair model evaluation in MCQA tasks.

Abstract: Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

[125] SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation

Han Huang, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen

Main category: cs.CV

TL;DR: SliceSemOcc - a novel vertical slice based multimodal framework for 3D semantic occupancy prediction that addresses height-axis information neglect in existing methods through global/local vertical slices and SEAttention3D module.

Details

Motivation: Existing 3D semantic occupancy prediction methods overlook height-axis information when processing voxel features, and conventional channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights.

Method: Extracts voxel features along height-axis using both global and local vertical slices, employs global local fusion module to reconcile spatial details with contextual information, and introduces SEAttention3D module that preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer.

Result: Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets show significant mean IoU improvements, with especially pronounced gains on most small-object categories.

Conclusion: The proposed SliceSemOcc framework effectively addresses height-axis information processing limitations in 3D semantic occupancy prediction and demonstrates superior performance through comprehensive validation.

Abstract: Driven by autonomous driving’s demands for precise 3D perception, 3D semantic occupancy prediction has become a pivotal research topic. Unlike bird’s-eye-view (BEV) methods, which restrict scene representation to a 2D plane, occupancy prediction leverages a complete 3D voxel grid to model spatial structures in all dimensions, thereby capturing semantic variations along the vertical axis. However, most existing approaches overlook height-axis information when processing voxel features. And conventional SENet-style channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights. To address these limitations, we propose SliceSemOcc, a novel vertical slice based multimodal framework for 3D semantic occupancy representation. Specifically, we extract voxel features along the height-axis using both global and local vertical slices. Then, a global local fusion module adaptively reconciles fine-grained spatial details with holistic contextual information. Furthermore, we propose the SEAttention3D module, which preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer. Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets verify that our method significantly enhances mean IoU, achieving especially pronounced gains on most small-object categories. Detailed ablation studies further validate the effectiveness of the proposed SliceSemOcc framework.

[126] Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding

Solha Kang, Esla Timothy Anzaku, Wesley De Neve, Arnout Van Messem, Joris Vankerschaver, Francois Rameau, Utku Ozbulak

Main category: cs.CV

TL;DR: Novel method to detect spurious correlations in vision transformers, showing training methodology impacts reliance on spurious signals and identifying problematic ImageNet classes.

Details

Motivation: Neural networks can exploit unintended patterns (spurious correlations) in data, leading to correct but unreliable predictions based on coincidental signals rather than genuine features.

Method: Proposed detection method for vision transformers using both supervised and self-supervised trained models, with large-scale experiments on ImageNet dataset.

Result: Method successfully identifies spurious correlations; training methodology significantly affects model reliance on spurious signals; certain ImageNet classes contain easily detected spurious signals.

Conclusion: Provides exhaustive list of problematic images for research caution, demonstrates real-world application in breast mass classification, and emphasizes importance of detecting spurious correlations for trustworthy ML models.

Abstract: Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small texts within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a novel method to detect spurious correlations in vision transformers, a type of neural network architecture that gained significant popularity in recent years. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a significant impact on the model’s reliance on spurious correlations. Furthermore, we show that certain classes in the ImageNet dataset contain spurious signals that are easily detected by the models and discuss the underlying reasons for those spurious signals. In light of our findings, we provide an exhaustive list of the aforementioned images and call for caution in their use in future research efforts. Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in real-world scenarios.

[127] Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning

Shiku Kaito, Shinnosuke Matsuo, Daiki Suehiro, Ryoma Bise

Main category: cs.CV

TL;DR: A novel multi-class Multiple-Instance Learning problem called Learning from Majority Label (LML) where bag labels are determined by majority class of instances, with a Counting Network and Majority Proportion Enhancement Module to improve performance.

Details

Motivation: To address real-world applications where only majority class labels are available for bags of instances, such as pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring.

Method: Proposes a Counting Network trained to produce bag-level majority labels by counting instances per class, and a Majority Proportion Enhancement Module (MPEM) that removes minority class instances to increase majority class proportion.

Result: Superior performance demonstrated on four datasets compared to conventional MIL methods, with ablation studies confirming effectiveness of each module.

Conclusion: The LML framework effectively solves the novel majority-label learning problem and provides practical solutions for various applications where majority-based labeling is natural.

Abstract: The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.

[128] Millisecond-Response Tracking and Gazing System for UAVs: A Domestic Solution Based on “Phytium + Cambricon”

Yuchen Zhu, Longxiang Yin, Kai Zhao

Main category: cs.CV

TL;DR: Proposes a heterogeneous computing system using Phytium processors and Cambricon accelerators to achieve millisecond-level response for UAV tracking with 98.5% accuracy.

Details

Motivation: Traditional camera systems have >200ms response delays due to insufficient deep feature extraction and computing bottlenecks, failing to meet real-time requirements in complex dynamic scenarios.

Method: Hardware: Phytium FT-2000/4 processors + MLU220 accelerator cards with multi-card parallelism. Software: Lightweight YOLOv5s detection network integrated with DeepSORT cascaded tracking algorithm in a “detection-tracking-feedback” closed-loop control chain.

Result: Achieves 50-100ms single-frame processing delay at 1920*1080 resolution with over 98.5% multi-scale target recognition accuracy.

Conclusion: Provides an innovative solution for UAV monitoring with both low latency and high precision, demonstrating successful application of domestic chips in real-time surveillance systems.

Abstract: In the frontier research and application of current video surveillance technology, traditional camera systems exhibit significant limitations of response delay exceeding 200 ms in dynamic scenarios due to the insufficient deep feature extraction capability of automatic recognition algorithms and the efficiency bottleneck of computing architectures, failing to meet the real-time requirements in complex scenes. To address this issue, this study proposes a heterogeneous computing architecture based on Phytium processors and Cambricon accelerator cards, constructing a UAV tracking and gazing system with millisecond-level response capability. At the hardware level, the system adopts a collaborative computing architecture of Phytium FT-2000/4 processors and MLU220 accelerator cards, enhancing computing power through multi-card parallelism. At the software level, it innovatively integrates a lightweight YOLOv5s detection network with a DeepSORT cascaded tracking algorithm, forming a closed-loop control chain of “detection-tracking-feedback”. Experimental results demonstrate that the system achieves a stable single-frame comprehensive processing delay of 50-100 ms in 1920*1080 resolution video stream processing, with a multi-scale target recognition accuracy of over 98.5%, featuring both low latency and high precision. This study provides an innovative solution for UAV monitoring and the application of domestic chips.

[129] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification

Quang-Huy Che, Le-Chuong Nguyen, Gia-Nghia Tran, Dinh-Duy Phan, Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: An efficient re-ranking method for person re-identification that uses K-nearest Weighted Fusion to generate multi-view features from neighbors, improving accuracy without requiring model fine-tuning or extra annotations.

Details

Motivation: Previous re-identification methods rely on single-view images which suffer from view bias, pose variation, viewpoint changes, and occlusions. Multi-view features can help reduce these issues and improve re-ranking accuracy.

Method: Proposes K-nearest Weighted Fusion (KWF) method to generate multi-view features by aggregating neighbors’ features in an unsupervised manner. Explores weight selection strategies during feature aggregation and works without model fine-tuning.

Result: Significant improvements in Rank@1 and mAP on Market1501, MSMT17, and Occluded-DukeMTMC datasets. Achieves 9.8%/22.0% Rank@1 improvements on MSMT17 and Occluded-DukeMTMC respectively, with enhanced computational efficiency.

Conclusion: The proposed re-ranking method effectively reduces view bias through multi-view feature generation, demonstrates strong performance improvements on challenging datasets, and offers computational efficiency advantages over other re-ranking approaches.

Abstract: In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors’ features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.

[130] TriLiteNet: Lightweight Model for Multi-Task Visual Perception

Quang-Huy Che, Duc-Khai Lam

Main category: cs.CV

TL;DR: TriLiteNet is an efficient multi-task perception model for ADAS that simultaneously handles vehicle detection, drivable area segmentation, and lane line segmentation with low computational costs and competitive performance.

Details

Motivation: Advanced Driver Assistance Systems require rapid processing and response for safety in real-world environments, necessitating efficient perception models that can handle multiple tasks with low computational demands.

Method: The study introduces TriLiteNet model designed to optimize performance while maintaining low computational costs. It includes both base (2.35M parameters) and tiny (0.14M parameters) configurations for multi-task panoramic driving perception.

Result: On BDD100k dataset, TriLiteNet_base achieved 85.6% recall for vehicle detection, 92.4% mIoU for drivable area segmentation, and 82.3% accuracy for lane line segmentation with only 7.72 GFLOPs. Both configurations showed low latency and reasonable power consumption on embedded devices.

Conclusion: TriLiteNet provides a practical and deployable solution for real-world autonomous driving applications by balancing performance, computational efficiency, and scalability across multiple perception tasks.

Abstract: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.

[131] DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset

Mustafa Sakhai, Kaung Sithu, Min Khant Soe Oke, Maciej Wielgosz

Main category: cs.CV

TL;DR: DVS-PedX is a neuromorphic dataset for pedestrian detection and crossing-intention analysis using event cameras, featuring both synthetic and real-world event streams with paired RGB frames and labels.

Details

Motivation: To advance research in event-based pedestrian safety and intention prediction by providing a comprehensive dataset that works with both synthetic and real-world event camera data, addressing the need for robust pedestrian detection in various weather conditions.

Method: Created dataset with two complementary sources: (1) synthetic event streams from CARLA simulator with controlled scenes under varied weather/lighting, and (2) real-world JAAD dash-cam videos converted to event streams using v2e tool. Includes paired RGB frames, DVS event frames (33ms accumulations), and frame-level crossing labels.

Result: Dataset provides raw AEDAT event files, AVI DVS videos, and metadata for flexible processing. Baseline SNNs using SpikingJelly demonstrate dataset usability and reveal a sim-to-real gap, highlighting the need for domain adaptation and multimodal fusion approaches.

Conclusion: DVS-PedX serves as a valuable resource to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception, particularly for developing robust systems that work across simulation and real-world domains.

Abstract: Event cameras like Dynamic Vision Sensors (DVS) report micro-timed brightness changes instead of full frames, offering low latency, high dynamic range, and motion robustness. DVS-PedX (Dynamic Vision Sensor Pedestrian eXploration) is a neuromorphic dataset designed for pedestrian detection and crossing-intention analysis in normal and adverse weather conditions across two complementary sources: (1) synthetic event streams generated in the CARLA simulator for controlled “approach-cross” scenes under varied weather and lighting; and (2) real-world JAAD dash-cam videos converted to event streams using the v2e tool, preserving natural behaviors and backgrounds. Each sequence includes paired RGB frames, per-frame DVS “event frames” (33 ms accumulations), and frame-level labels (crossing vs. not crossing). We also provide raw AEDAT 2.0/AEDAT 4.0 event files and AVI DVS video files and metadata for flexible re-processing. Baseline spiking neural networks (SNNs) using SpikingJelly illustrate dataset usability and reveal a sim-to-real gap, motivating domain adaptation and multimodal fusion. DVS-PedX aims to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception.

[132] TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee, Josep Lladós, Umapada Pal, Anjan Dutta

Main category: cs.CV

TL;DR: TaleDiffusion is a novel framework for text-to-story visualization that maintains character consistency across frames and accurately renders dialogues through iterative processing and attention mechanisms.

Details

Motivation: Existing methods struggle with character consistency, artifact generation, and inaccurate dialogue rendering, leading to disjointed storytelling in multi-character visual narratives.

Method: Uses pre-trained LLM for frame descriptions, character details, and dialogues via in-context learning. Implements bounded attention-based per-box mask technique, identity-consistent self-attention for character consistency, region-aware cross-attention for object placement, and CLIPSeg for dialogue bubble assignment.

Result: Experimental results show TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering quality.

Conclusion: The proposed framework successfully addresses key challenges in text-to-story visualization by maintaining character consistency and accurate dialogue assignment through innovative attention mechanisms and postprocessing techniques.

Abstract: Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

[133] MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

Yuan Zhao, Liu Lin

Main category: cs.CV

TL;DR: MEPG framework uses LLMs to decompose prompts into spatial coordinates and style instructions, then employs specialized expert models for cross-region generation with attention-based gating, achieving superior image quality and style diversity.

Details

Motivation: Text-to-image diffusion models struggle with complex multi-element prompts and limited stylistic diversity, requiring a more sophisticated approach to handle spatial relationships and style variations.

Method: Two-component framework: 1) Position-Style-Aware module with fine-tuned LLM for prompt decomposition, 2) Multi-Expert Diffusion module with dynamic expert routing and attention-based gating for specialized regional generation.

Result: MEPG significantly outperforms baseline models with the same backbone in both image quality and style diversity.

Conclusion: The proposed framework successfully addresses limitations of current diffusion models by integrating position-style aware planning with multi-expert generation, offering strong extensibility and real-time editing capabilities.

Abstract: Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity.

[134] Revisiting Simple Baselines for In-The-Wild Deepfake Detection

Orlando Castaneda, Kevin So-Tang, Kshitij Gurung

Main category: cs.CV

TL;DR: Simple hyperparameter tuning of existing deepfake detection models achieves 81% accuracy on Deepfake-Eval-2024 benchmark, matching commercial detectors’ performance.

Details

Motivation: Existing deepfake detectors underperform on real-world benchmarks compared to commercial solutions, with reported accuracies of only 61-69% on the challenging Deepfake-Eval-2024 dataset.

Method: Revisited and improved Ojha et al.’s approach by applying better hyperparameter tuning to standard pretrained vision backbones for deepfake detection.

Result: Achieved 81% accuracy on Deepfake-Eval-2024, an 18% improvement over previous baseline results, making it competitive with leading commercial detectors (82% accuracy).

Conclusion: Proper hyperparameter tuning can significantly boost performance of simple deepfake detection approaches, making them practical for real-world deployment with good accuracy-computational cost tradeoffs.

Abstract: The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released “in-the-wild” benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance – 81% accuracy on Deepfake-Eval-2024 – surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.

[135] YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components

Serhii Svystun, Pavlo Radiuk, Oleksandr Melnychenko, Oleg Savenko, Anatoliy Sachenko

Main category: cs.CV

TL;DR: Ensemble of YOLO models combining visible and thermal imagery improves wind turbine defect detection accuracy over single models

Details

Motivation: UAVs provide new monitoring capabilities for wind power plants, but reliable defect detection requires high-resolution multispectral data processing methods

Method: Developed ensemble approach integrating general-purpose YOLOv8 with specialized thermal model using bounding box fusion algorithm to combine predictions from visible and thermal channels

Result: Achieved mAP@.5 of 0.93 and F1-score of 0.90, outperforming standalone YOLOv8 model (mAP@.5 of 0.91)

Conclusion: Combining multiple YOLO architectures with fused multispectral data provides more reliable detection of both visual and thermal defects in wind turbine components

Abstract: Unmanned aerial vehicles (UAVs) equipped with advanced sensors have opened up new opportunities for monitoring wind power plants, including blades, towers, and other critical components. However, reliable defect detection requires high-resolution data and efficient methods to process multispectral imagery. In this research, we aim to enhance defect detection accuracy through the development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels. We propose an ensemble approach that integrates a general-purpose YOLOv8 model with a specialized thermal model, using a sophisticated bounding box fusion algorithm to combine their predictions. Our experiments show this approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. These findings demonstrate that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution, improving the detection of both visual and thermal defects.

[136] VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision

Safouane El Ghazouali, Umberto Michelucci

Main category: cs.CV

TL;DR: VisioFirm is an open-source web application that uses AI-assisted automation to streamline image labeling, reducing manual annotation effort by up to 90% while maintaining high accuracy.

Details

Motivation: Traditional image annotation tools require extensive manual input, limiting scalability for large datasets. AI models need annotated data to learn patterns, but annotation is labor-intensive, especially for complex tasks like object detection and segmentation.

Method: VisioFirm integrates state-of-the-art foundation models with a filtering pipeline. It uses CLIP combined with pre-trained detectors (Ultralytics models) for common classes and zero-shot models (Grounding DINO) for custom labels. Features low-confidence thresholding for high recall, interactive refinement tools, on-the-fly segmentation with Segment Anything accelerated via WebGPU, and uses clustering and IoU-graph for redundant detection suppression.

Result: The tool demonstrates up to 90% reduction in manual effort on diverse datasets while maintaining high annotation accuracy. Initial predictions on COCO-type classes are mostly correct, and the system supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) with offline operation after model caching.

Conclusion: VisioFirm successfully addresses the scalability limitations of traditional annotation tools by combining AI automation with human refinement, significantly reducing manual effort while ensuring high-quality annotations through its hybrid approach and advanced filtering techniques.

Abstract: AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.

[137] DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval

Ruohong Yang, Peng Hu, Yunfan Li, Xi Peng

Main category: cs.CV

TL;DR: DUDE is a novel unsupervised cross-domain image retrieval method that uses text-to-image generative models to disentangle object features from domain-specific styles, achieving state-of-the-art performance across multiple domains.

Details

Motivation: Existing UCIR methods struggle with domain gaps because object features are entangled with domain-specific styles, making cross-domain retrieval challenging without annotations.

Method: DUDE leverages text-to-image generative models to disentangle object features from domain styles, then progressively aligns mutual neighbors from within domains to across domains for reliable feature alignment.

Result: Extensive experiments show DUDE achieves state-of-the-art performance across three benchmark datasets covering 13 domains.

Conclusion: Feature disentanglement through generative models combined with progressive cross-domain alignment effectively addresses the domain gap problem in unsupervised image retrieval.

Abstract: Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.

[138] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li

Main category: cs.CV

TL;DR: LASER is a self-evolving framework that enhances VLMs’ multi-step perception for precise coordinate prediction in GUI grounding tasks, achieving state-of-the-art performance on ScreenSpot benchmarks.

Details

Motivation: Enabling VLMs to effectively reason over appropriate image regions remains challenging in GUI grounding, especially with high-resolution inputs and complex multi-element visual interactions.

Method: Integrates Monte Carlo quality estimation with IoU-based region quality evaluation to construct high-quality preference data, guiding models to focus on instruction-relevant regions and adapt reasoning steps based on task complexity.

Result: Achieves 55.7 score on ScreenSpot-Pro benchmark with GTA1-7B model, establishing new state-of-the-art among 7B-scale models with consistent performance gains on ScreenSpot benchmarks.

Conclusion: LASER framework effectively endows VLMs with multi-step perception capabilities for precise coordinate prediction, demonstrating significant improvements in GUI grounding tasks.

Abstract: Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

[139] Differential Morphological Profile Neural Networks for Semantic Segmentation

David Huangal, J. Alex Hurt

Main category: cs.CV

TL;DR: Integrating Differential Morphological Profile (DMP) features into semantic segmentation networks improves performance on overhead remote sensing imagery by addressing scale variation and shape extraction challenges.

Details

Motivation: Overhead remote sensing imagery has unique challenges like extreme scale variation, foreground-background imbalance, and large image sizes that standard segmentation networks (developed for ground-perspective photos) don't address effectively.

Method: Incorporated DMP (multi-scale shape extraction method based on grayscale morphology) into three state-of-the-art segmentation architectures using both direct input (adapting input stems) and hybrid dual-stream designs that fuse RGB and DMP encoders.

Result: Hybrid DMP architectures consistently outperformed direct-input variants and were capable of surpassing non-DMP models on mIoU, F1, and Recall metrics using the iSAID benchmark dataset.

Conclusion: DMP integration, particularly through hybrid dual-stream architectures, provides effective shape information that enhances semantic segmentation performance for overhead remote sensing imagery.

Abstract: Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State-of-the-art segmentation networks are typically developed and tuned on ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi-scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual-stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall.

[140] TauGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models

Yuxin Gong, Se-in Jang, Wei Shao, Yi Su, Kuang Gong

Main category: cs.CV

TL;DR: A text-guided 3D diffusion model that synthesizes tau PET images using structural MRI and plasma p-tau217 measurements as multimodal conditions, providing a cost-effective alternative to actual tau PET scans.

Details

Motivation: Tau PET scans are crucial for Alzheimer's disease diagnosis but are expensive and limited in availability, while structural MRI and plasma biomarkers are more accessible and non-invasive.

Method: Proposes a 3D diffusion model that uses textual prompts from plasma p-tau217 measurements and anatomical constraints from structural MRI to synthesize realistic 3D tau PET images.

Result: The framework successfully generates clinically meaningful 3D tau PET images across various disease stages using ADNI data, demonstrating realistic synthesis capabilities.

Conclusion: This approach enables tau PET data augmentation, provides a cost-effective visualization alternative for tau pathology, and supports disease progression simulation under varying biomarker conditions.

Abstract: Accurate quantification of tau pathology via tau positron emission tomography (PET) scan is crucial for diagnosing and monitoring Alzheimer’s disease (AD). However, the high cost and limited availability of tau PET restrict its widespread use. In contrast, structural magnetic resonance imaging (MRI) and plasma-based biomarkers provide non-invasive and widely available complementary information related to brain anatomy and disease progression. In this work, we propose a text-guided 3D diffusion model for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Specifically, the textual prompt is from the plasma p-tau217 measurement, which is a key indicator of AD progression, while MRI provides anatomical structure constraints. The proposed framework is trained and evaluated using clinical AV1451 tau PET data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Experimental results demonstrate that our approach can generate realistic, clinically meaningful 3D tau PET across a range of disease stages. The proposed framework can help perform tau PET data augmentation under different settings, provide a non-invasive, cost-effective alternative for visualizing tau pathology, and support the simulation of disease progression under varying plasma biomarker levels and cognitive conditions.

[141] Dual-Scale Volume Priors with Wasserstein-Based Consistency for Semi-Supervised Medical Image Segmentation

Junying Meng, Gangxuan Zhou, Jun Liu, Weihong Guo

Main category: cs.CV

TL;DR: A semi-supervised medical image segmentation framework that integrates spatial regularization and volume priors through Wasserstein distance constraints at both image and dataset scales.

Details

Motivation: Most existing semi-supervised medical image segmentation networks overlook effective methodological guidance for feature extraction and important prior information from datasets.

Method: Integrates explicit volume prior at image scale and Threshold Dynamics spatial regularization into backbone segmentation network. Uses regression network to estimate target region volumes and enforces constraints through image-scale and dataset-scale Wasserstein distance loss functions.

Result: Experimental results on 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show superiority of the proposed method.

Conclusion: The framework effectively leverages volume priors and spatial regularization to improve semi-supervised medical image segmentation performance across multiple datasets.

Abstract: Despite signi cant progress in semi-supervised medical image segmentation, most existing segmentation networks overlook e ective methodological guidance for feature extraction and important prior information from datasets. In this paper, we develop a semi-supervised medical image segmentation framework that e ectively integrates spatial regularization methods and volume priors. Speci cally, our approach integrates a strong explicit volume prior at the image scale and Threshold Dynamics spatial regularization, both derived from variational models, into the backbone segmentation network. The target region volumes for each unlabeled image are estimated by a regression network, which e ectively regularizes the backbone segmentation network through an image-scale Wasserstein distance constraint, ensuring that the class ratios in the segmentation results for each unlabeled image match those predicted by the regression network. Additionally, we design a dataset-scale Wasserstein distance loss function based on a weak implicit volume prior, which enforces that the volume distribution predicted for the unlabeled dataset is similar to that of labeled dataset. Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show the superiority of the proposed method.

[142] PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng, Kartic Subr, Hakan Bilen

Main category: cs.CV

TL;DR: Self-supervised framework for learning articulated object representations from sparse-view images without camera pose supervision, using only 4 views per articulation.

Details

Motivation: To overcome limitations of prior methods that require dense multi-view observations and ground-truth camera poses, enabling articulated object reconstruction under weaker input assumptions.

Method: Reconstruct each articulation independently using sparse-view 3D reconstruction, learn deformation field for cross-pose correspondences, progressive disentanglement of static/moving parts, and joint optimization of geometry, appearance, and kinematics with self-supervised consistency losses.

Result: Produces accurate and detailed articulated object representations from sparse inputs, outperforming existing approaches on standard benchmarks and real-world examples.

Conclusion: The method successfully learns articulated object representations with significantly weaker input requirements than previous approaches, demonstrating robust performance with minimal supervision.

Abstract: We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

Yingxuan Li, Jiafeng Mao, Yusuke Matsui

Main category: cs.CV

TL;DR: Using synthetic images as reference points to identify and correct mislabeled samples in noisy datasets, improving classification accuracy especially for semantic label noise.

Details

Motivation: Semantic noise in image classification datasets where visually similar categories are frequently mislabeled poses challenges for supervised learning. Synthetic images from text-to-image models provide reliable labels but have domain gaps when used directly.

Method: Proposes a novel approach that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets, rather than using them directly for training.

Result: Significantly improves classification accuracy under various noise conditions, especially with semantic label noise. Achieves 30% improvement on CIFAR-10, 11% on CIFAR-100 under 70% semantic noise, and 24% on ImageNet-100 under real-world noise.

Conclusion: The method effectively addresses semantic noise in datasets by using synthetic images as reference points, works orthogonally with existing noise-robust techniques, and achieves superior performance when combined with state-of-the-art methods.

Abstract: Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. In this paper, we explore the potential of using synthetic images generated by advanced text-to-image models to address this issue. Although these high-quality synthetic images come with reliable labels, their direct application in training is limited by domain gaps and diversity constraints. Unlike conventional approaches, we propose a novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Extensive experiments across multiple benchmark datasets show that our approach significantly improves classification accuracy under various noise conditions, especially in challenging scenarios with semantic label noise. Additionally, since our method is orthogonal to existing noise-robust learning techniques, when combined with state-of-the-art noise-robust training methods, it achieves superior performance, improving accuracy by 30% on CIFAR-10 and by 11% on CIFAR-100 under 70% semantic noise, and by 24% on ImageNet-100 under real-world noise conditions.

[144] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao

Main category: cs.CV

TL;DR: This paper introduces an image-oriented self-adaptive dataset construction method for real-world multimodal safety scenarios, generating 35k image-text pairs with guidance responses, and proposes a standardized safety evaluation metric using a fine-tuned safety judge model.

Details

Motivation: Current risk-oriented dataset construction methods fail to cover the growing complexity of real-world multimodal safety scenarios, and there's a lack of unified evaluation metrics to prove their overall effectiveness.

Method: Proposes an image-oriented self-adaptive dataset construction method that starts with images and constructs paired text and guidance responses. Also introduces a standardized safety evaluation metric by fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.

Result: Generated an RMS dataset of 35k image-text pairs with guidance responses. Extensive experiments demonstrate the effectiveness of the proposed image-oriented pipeline, confirming its scalability and effectiveness.

Conclusion: The image-oriented approach offers a new perspective for constructing real-world multimodal safety datasets, addressing the limitations of current risk-oriented methods and providing a standardized evaluation framework.

Abstract: Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

[145] Efficient Odd-One-Out Anomaly Detection

Silvio Chito, Paolo Rabino, Tatiana Tommasi

Main category: cs.CV

TL;DR: A DINO-based model for odd-one-out anomaly detection that reduces parameters by 1/3 and training time by 3x while maintaining competitive performance, with MLLM baseline analysis.

Details

Motivation: Address challenges in odd-one-out anomaly detection requiring spatial reasoning across multiple views and relational reasoning, while focusing on efficiency improvements over current methods.

Method: Propose a DINO-based model that significantly reduces computational requirements - 1/3 fewer parameters and 3x faster training compared to state-of-the-art approaches.

Result: Maintains competitive performance while achieving substantial efficiency gains. Also introduces MLLM baseline revealing current limitations in structured visual reasoning tasks.

Conclusion: The proposed efficient DINO-based approach successfully addresses odd-one-out detection challenges with significantly reduced computational costs, providing insights into MLLM limitations for visual reasoning.

Abstract: The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

[146] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li

Main category: cs.CV

TL;DR: GeoArena is an open platform that addresses data leakage and privacy issues in image geolocalization evaluation by using in-the-wild images and human pairwise judgments instead of exact coordinates.

Details

Motivation: Current image geolocalization evaluation suffers from data leakage (models pretrained on test data) and privacy concerns with exact coordinate metrics that neglect reasoning processes.

Method: Proposed GeoArena platform that allows users to upload in-the-wild images and uses pairwise human judgments to evaluate which model outputs better align with human expectations.

Result: Deployed online for 2 months, collected thousands of voting records, established detailed analysis and leaderboard comparing different large vision-language models on geolocalization tasks.

Conclusion: GeoArena provides a more accurate, privacy-preserving, and human-centered benchmarking approach for evaluating image geolocalization capabilities of LVLMs.

Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model’s actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

[147] From Editor to Dense Geometry Estimator

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao

Main category: cs.CV

TL;DR: FE2E is a framework that adapts diffusion-based image editing models (rather than text-to-image generators) for dense geometry estimation, achieving significant performance improvements in monocular depth and normal estimation without additional training data.

Details

Motivation: Dense prediction is inherently an image-to-image task, suggesting image editing models may be more suitable than text-to-image generative models as a foundation for fine-tuning. Editing models possess inherent structural priors that enable more stable convergence and higher performance.

Method: Adapts an advanced editing model based on Diffusion Transformer (DiT) architecture. Reformulates the editor’s flow matching loss into “consistent velocity” training objective. Uses logarithmic quantization to resolve precision conflicts. Leverages DiT’s global attention for joint depth and normal estimation in a single forward pass.

Result: Achieves over 35% performance gains on ETH3D dataset. Outperforms DepthAnything series trained on 100x more data. Shows impressive improvements in zero-shot monocular depth and normal estimation across multiple datasets.

Conclusion: Image editing models are superior to generative models for dense geometry estimation tasks. The FE2E framework successfully adapts diffusion-based editors for deterministic prediction tasks with significant performance improvements.

Abstract: Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the consistent velocity” training objective. And we use logarithmic quantization to resolve the precision conflict between the editor’s native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT’s global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

[148] The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah

Main category: cs.CV

TL;DR: UCF-UM is a cyclic evaluation framework that measures semantic drift in unified vision-language models by alternating between image-to-text and text-to-image generation over multiple cycles, revealing cross-modal consistency issues that single-pass metrics miss.

Details

Motivation: Existing evaluations for unified vision-language models assess image-to-text and text-to-image capabilities in isolation, failing to reveal whether models maintain semantic consistency when cycling between modalities or if they can render concepts they understand.

Method: Proposed UCF-UM framework with three metrics: Mean Cumulative Drift (embedding-based semantic loss), Semantic Drift Rate (decay rate), and Multi-Generation GenEval (object-level compliance). Created ND400 benchmark from NoCaps and DOCCI to test generalization beyond COCO, evaluating seven recent models.

Result: Substantial variation in cross-modal stability found - some models like BAGEL maintain semantics over many alternations, while others like Vila-u drift quickly despite strong single-pass scores. Reveals cyclic consistency issues not captured by standard evaluations.

Conclusion: Cyclic consistency is a necessary complement to standard I2T and T2I evaluations. UCF-UM provides practical metrics to assess unified models’ cross-modal stability and strength of shared representations, addressing a critical gap in current evaluation protocols.

Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model’s cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

[149] MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition

Feng-Qi Cui, Zhen Lin, Xinlong Rao, Anyang Tong, Shiyao Li, Fei Wang, Changlin Chen, Bin Liu

Main category: cs.CV

TL;DR: MICACL is a novel multi-instance learning framework for dynamic facial expression recognition that addresses long-tailed distributions and spatio-temporal complexity through graph-enhanced instance interaction, weighted aggregation, and contrastive learning.

Details

Motivation: Existing DFER methods struggle with long-tailed category distributions and complex spatio-temporal feature modeling, leading to severe model induction bias that limits performance.

Method: Proposes MICACL framework with: 1) Graph-Enhanced Instance Interaction Module (GEIIM) using adaptive adjacency matrices and multiscale convolutions, 2) Weighted Instance Aggregation Network (WIAN) for dynamic importance-based weighting, and 3) Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance major/minor category training.

Result: Extensive experiments on DFEW and FERV39k datasets show MICACL achieves state-of-the-art performance with superior robustness and generalization capabilities.

Conclusion: MICACL effectively addresses long-tailed distribution challenges and spatio-temporal complexity in DFER through integrated instance interaction, weighted aggregation, and contrastive learning optimization.

Abstract: Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.

[150] Stitching the Story: Creating Panoramic Incident Summaries from Body-Worn Footage

Dor Cohen, Inga Efrosman, Yehudit Aperstein, Alexander Apartsin

Main category: cs.CV

TL;DR: A computer vision pipeline that converts body-camera footage into panoramic scene summaries using SLAM, clustering, and multi-frame stitching for rapid situational awareness.

Details

Motivation: First responders need quick visual summaries of incident scenes from body-camera footage instead of reviewing lengthy videos in time-critical situations.

Method: Uses monocular SLAM for camera trajectory estimation and spatial reconstruction, clusters camera poses to identify key viewpoints, selects representative frames, and fuses them into panoramic images with multi-frame stitching.

Result: Produces spatially coherent panoramic images that summarize complex environments, enabling rapid understanding and efficient decision-making.

Conclusion: The pipeline effectively transforms body-camera footage into concise visual summaries that support quick interpretation and incident review for first responders.

Abstract: First responders widely adopt body-worn cameras to document incident scenes and support post-event analysis. However, reviewing lengthy video footage is impractical in time-critical situations. Effective situational awareness demands a concise visual summary that can be quickly interpreted. This work presents a computer vision pipeline that transforms body-camera footage into informative panoramic images summarizing the incident scene. Our method leverages monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the spatial layout of the environment. Key viewpoints are identified by clustering camera poses along the trajectory, and representative frames from each cluster are selected. These frames are fused into spatially coherent panoramic images using multi-frame stitching techniques. The resulting summaries enable rapid understanding of complex environments and facilitate efficient decision-making and incident review.

[151] AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Hao Ju, Hu Zhang, Zhedong Zheng

Main category: cs.CV

TL;DR: AnomalyLMM is the first framework using Large Multi-modal Models for text-based person anomaly search, addressing fine-grained cross-modal alignment and sparse anomaly samples through a training-free adaptation approach.

Details

Motivation: Text-based person anomaly search is critical for public safety but faces challenges in fine-grained cross-modal alignment and sparse real-world anomaly samples. Current LMMs have potential but suffer from domain gaps and lack efficient adaptation strategies.

Method: Proposes a coarse-to-fine pipeline with training-free adaptation techniques including masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking to enable zero-shot anomaly detection.

Result: Achieves +0.96% improvement in Recall@1 accuracy on the PAB dataset compared to competitive baselines, with interpretable alignment between textual anomalies and visual behaviors.

Conclusion: AnomalyLMM successfully bridges generative world knowledge with retrieval-centric anomaly detection, demonstrating LMMs’ potential for fine-grained person anomaly search without requiring training.

Abstract: With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

[152] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao

Main category: cs.CV

TL;DR: ASE-MLLM is the first framework to integrate image aesthetic saliency into multimodal large language models for aesthetic image captioning, achieving state-of-the-art performance.

Details

Motivation: Existing AIC works using MLLMs don't specifically adapt models to focus on aesthetic content and primarily rely on fine-tuning without explicit aesthetic saliency integration.

Method: Proposed ASE-MLLM framework with Image Aesthetic Saliency Module (IASM) to extract aesthetic saliency features, and IAS-ViT encoder that fuses aesthetic saliency with original image features via cross-attention mechanism.

Result: Significantly outperformed traditional methods and generic MLLMs on mainstream AIC benchmarks, achieving state-of-the-art performance.

Conclusion: Explicit incorporation of aesthetic saliency into MLLMs through the proposed end-to-end framework effectively addresses the limitations of existing approaches for aesthetic image captioning.

Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

[153] SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer

Jimin Xu, Bosheng Qin, Tao Jin, Zhou Zhao, Zhenhui Ye, Jun Yu, Fei Wu

Main category: cs.CV

TL;DR: A novel 3D style transfer pipeline that uses diffusion priors to generate stylized key views with cross-view attention and instance-level consistency, outperforming existing methods.

Details

Motivation: Existing 3D style transfer methods struggle with extracting high-level style semantics and maintaining structural clarity, often resulting in indistinguishable objects within stylized scenes.

Method: Two-stage pipeline: 1) Use diffusion priors to generate stylized renderings of key viewpoints with cross-view attention for consistency, 2) Transfer stylized key views to 3D representation with instance-level style transfer.

Result: Significantly outperforms state-of-the-art methods across various scenes, producing more structured, visually coherent, and artistically enriched stylizations.

Conclusion: The proposed approach effectively integrates 2D diffusion knowledge into 3D style transfer, achieving superior results with better style fidelity and instance-level consistency.

Abstract: Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.

[154] Learning neural representations for X-ray ptychography reconstruction with unknown probes

Tingyou Li, Zixin Xu, Zirui Gao, Hanfei Yan, Xiaojing Huang, Jizhou Li

Main category: cs.CV

TL;DR: PtyINR is a self-supervised neural framework that simultaneously reconstructs objects and unknown probes in X-ray ptychography, achieving superior quality and robustness in low-signal conditions without requiring probe pre-characterization.

Details

Motivation: X-ray ptychography faces challenges in accurate image reconstruction when the illuminating probe is unknown, especially under low-signal conditions of low-dose and high-speed experiments, limiting its broader adoption.

Method: Ptychographic Implicit Neural Representation (PtyINR) parameterizes both object and probe as continuous neural representations, performing end-to-end reconstruction directly from raw diffraction patterns without probe pre-characterization.

Result: Extensive evaluations show PtyINR achieves superior reconstruction quality on both simulated and experimental data with remarkable robustness under challenging low-signal conditions.

Conclusion: PtyINR provides a generalizable, physics-informed framework for probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems beyond ptychography.

Abstract: X-ray ptychography provides exceptional nanoscale resolution and is widely applied in materials science, biology, and nanotechnology. However, its full potential is constrained by the critical challenge of accurately reconstructing images when the illuminating probe is unknown. Conventional iterative methods and deep learning approaches are often suboptimal, particularly under the low-signal conditions inherent to low-dose and high-speed experiments. These limitations compromise reconstruction fidelity and restrict the broader adoption of the technique. In this work, we introduce the Ptychographic Implicit Neural Representation (PtyINR), a self-supervised framework that simultaneously addresses the object and probe recovery problem. By parameterizing both as continuous neural representations, PtyINR performs end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Extensive evaluations demonstrate that PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Furthermore, PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems.

[155] Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou, Taoran Yi, Jiemin Fang, Chen Yang, Lingxi Xie, Xinggang Wang, Wei Shen, Qi Tian

Main category: cs.CV

TL;DR: MDT-dist framework accelerates 3D flow-based generation by distilling pretrained models to 1-2 steps instead of 25, achieving 6.5-9x speedup while maintaining quality.

Details

Motivation: Flow-based 3D generation models require dozens of sampling steps during inference, and few-step distillation methods remain under-explored for complex 3D tasks despite success in 2D diffusion models.

Method: Proposes MDT-dist framework with two optimizable objectives: Velocity Matching (VM) to match velocity fields between student and teacher models, and Velocity Distillation (VD) for probability density distillation using learned velocity fields.

Result: Reduces sampling steps from 25 to 1-2, achieving 0.68s (1 step) and 0.94s (2 steps) latency with 9.0x and 6.5x speedup on A800 while preserving high visual and geometric fidelity.

Conclusion: Method significantly outperforms existing CM distillation approaches and enables superior performance in few-step 3D generation, demonstrating effective acceleration of complex 3D flow models.

Abstract: Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

[156] Durian: Dual Reference-guided Portrait Animation with Attribute Transfer

Hyunsoo Cha, Byungjun Kim, Hanbyul Joo

Main category: cs.CV

TL;DR: Durian is a zero-shot portrait animation method that transfers facial attributes from reference images to target portraits using dual reference networks and diffusion models.

Details

Motivation: To enable high-fidelity, spatially consistent facial attribute transfer in portrait animation videos without explicit triplet supervision.

Method: Uses dual reference networks to inject spatial features from both portrait and attribute images into diffusion denoising. Trained with self-reconstruction formulation, mask expansion strategy, and spatial/appearance transformations for robustness.

Result: Achieves state-of-the-art performance on portrait animation with attribute transfer, enabling multi-attribute composition in a single generation pass.

Conclusion: Durian effectively generalizes across diverse attributes and reference combinations, demonstrating strong zero-shot transfer capabilities without additional training.

Abstract: We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.

[157] From Lines to Shapes: Geometric-Constrained Segmentation of X-Ray Collimators via Hough Transform

Benjamin El-Zein, Dominik Eckert, Andreas Fieselmann, Christopher Syben, Ludwig Ritschl, Steffen Kappler, Sebastian Stober

Main category: cs.CV

TL;DR: Deep learning-based segmentation method using differentiable Hough transform to detect collimator shadows in X-ray images, achieving accurate ROI detection with median Hausdorff distances of 4.3-5.0mm.

Details

Motivation: Collimation reduces radiation exposure by restricting X-ray imaging to specific regions, but detecting collimator shadows is challenging when edges are obscured by scattered radiation. The geometric prior knowledge that collimation forms polygonal shadows can be leveraged.

Method: A deep learning segmentation approach incorporating differentiable Hough transform-based network to detect collimation borders and extract ROI center information. Combines both tasks during inference to generate refined, line-constrained segmentation masks.

Result: Robust reconstruction of collimated regions with median Hausdorff distances of 4.3-5.0mm on diverse test sets of real X-ray images. Method handles up to four shadow borders but is not fundamentally limited by edge count.

Conclusion: The proposed geometry-constrained deep learning approach effectively detects collimator shadows in X-ray imaging, enabling accurate ROI identification while minimizing radiation exposure to patients.

Abstract: Collimation in X-ray imaging restricts exposure to the region-of-interest (ROI) and minimizes the radiation dose applied to the patient. The detection of collimator shadows is an essential image-based preprocessing step in digital radiography posing a challenge when edges get obscured by scattered X-ray radiation. Regardless, the prior knowledge that collimation forms polygonal-shaped shadows is evident. For this reason, we introduce a deep learning-based segmentation that is inherently constrained to its geometry. We achieve this by incorporating a differentiable Hough transform-based network to detect the collimation borders and enhance its capability to extract the information about the ROI center. During inference, we combine the information of both tasks to enable the generation of refined, line-constrained segmentation masks. We demonstrate robust reconstruction of collimated regions achieving median Hausdorff distances of 4.3-5.0mm on diverse test sets of real Xray images. While this application involves at most four shadow borders, our method is not fundamentally limited by a specific number of edges.

[158] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, Lu Qi

Main category: cs.CV

TL;DR: Survey paper on panoramic vision techniques focusing on perspective-to-panorama adaptation challenges and solutions across 20+ tasks from 300+ papers.

Details

Motivation: Growing demand for spatial intelligence and holistic scene perception in applications like VR, autonomous driving, and robotics, with unique challenges in adapting perspective methods to omnidirectional images.

Method: Systematic review of panoramic imaging pipeline, projection methods, and analysis of three key domain adaptation challenges: geometric distortions, non-uniform sampling, and boundary continuity. Cross-method and cross-task analysis across four major categories.

Result: Comprehensive survey covering 20+ representative tasks from 300+ research papers, providing strategies for addressing panoramic-specific challenges and classifying panoramic vision into four major categories.

Conclusion: Identifies open challenges and future directions in data, models, and applications to advance panoramic vision research, offering new insights and forward-looking perspectives for technology development.

Abstract: Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

[159] Plot’n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Kiymet Akdemir, Jing Shi, Kushal Kafle, Brian Price, Pinar Yanardag

Main category: cs.CV

TL;DR: Plot’n Polish is a zero-shot framework that enables consistent story visualization with fine-grained control over image generation and editing at multiple detail levels.

Details

Motivation: Existing text-to-image diffusion models lack flexibility for applying fine or coarse edits while maintaining visual and narrative consistency across multiple frames in story visualization.

Method: A zero-shot framework that provides enhanced control, refinement, and post-generation modification capabilities for consistent story visualization.

Result: The framework enables creators to seamlessly craft and refine visual stories with consistent narrative flow across multiple generated images.

Conclusion: Plot’n Polish addresses the important challenge of maintaining visual and narrative consistency in story visualization while providing flexible editing capabilities.

Abstract: Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot’n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

[160] Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image – Technical Preview

Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang

Main category: cs.CV

TL;DR: VFR is a novel video generative model for virtual try-on that generates arbitrarily long videos through segment-by-segment autoregressive generation, using prefix video conditions and anchor videos to ensure smooth transitions and temporal consistency.

Details

Motivation: To create a practical solution for long virtual try-on video generation that doesn't require resource-intensive processing or extensive video data, while maintaining high quality and flexibility for arbitrary video lengths.

Method: Auto-regressive segment-by-segment generation process using prefix video conditions for local smoothness between adjacent segments and anchor videos (360-degree wholebody capture) for global temporal consistency across different segments.

Result: The VFR framework successfully generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, pioneering long virtual try-on video generation.

Conclusion: VFR represents a significant advancement in video generation technology, providing an efficient and flexible solution for creating realistic, arbitrarily long virtual try-on videos with maintained quality and consistency.

Abstract: We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video – a 360-degree video that comprehensively captures the human’s wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.

[161] Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization

Zewei He, Zixuan Chen, Jinlei Li, Ziqian Lu, Xuecheng Sun, Hao Luo, Zhe-Ming Lu, Evangelos K. Markakis

Main category: cs.CV

TL;DR: A lightweight multi-receptive-field non-local network (MRFNLN) with under 1.5M parameters that outperforms state-of-the-art methods for image dehazing through multi-scale feature extraction, attention mechanisms, and novel contrastive regularization.

Details

Motivation: To enhance image dehazing performance by extracting richer features, capturing long-range dependencies, and focusing on low-level details while maintaining computational efficiency.

Method: Proposes MRFNLN with: 1) Multi-stream feature attention block (MSFAB) for multi-scale feature extraction using parallel convolutions (1x1, 3x3, 5x5) and attention mechanisms; 2) Cross non-local block (CNLB) with spatial pyramid down-sampling to capture long-range dependencies efficiently; 3) Detail-focused contrastive regularization (DFCR) that emphasizes low-level details in a specialized representation space.

Result: The proposed MRFNLN model outperforms recent state-of-the-art dehazing methods while maintaining computational efficiency with less than 1.5 million parameters.

Conclusion: The combination of multi-scale feature extraction, efficient non-local operations, and detail-focused regularization provides an effective and lightweight solution for high-performance image dehazing.

Abstract: Recently, deep learning-based methods have dominated image dehazing domain. A multi-receptive-field non-local network (MRFNLN) consisting of the multi-stream feature attention block (MSFAB) and the cross non-local block (CNLB) is presented in this paper to further enhance the performance. We start with extracting richer features for dehazing. Specifically, a multi-stream feature extraction (MSFE) sub-block, which contains three parallel convolutions with different receptive fields (i.e., $1\times 1$, $3\times 3$, $5\times 5$), is designed for extracting multi-scale features. Following MSFE, an attention sub-block is employed to make the model adaptively focus on important channels/regions. These two sub-blocks constitute our MSFAB. Then, we design a cross non-local block (CNLB), which can capture long-range dependencies beyond the query. Instead of the same input source of query branch, the key and value branches are enhanced by fusing more preceding features. CNLB is computation-friendly by leveraging a spatial pyramid down-sampling (SPDS) strategy to reduce the computation and memory consumption without sacrificing the performance. Last but not least, a novel detail-focused contrastive regularization (DFCR) is presented by emphasizing the low-level details and ignoring the high-level semantic information in a representation space specially designed for dehazing. Comprehensive experimental results demonstrate that the proposed MRFNLN model outperforms recent state-of-the-art dehazing methods with less than 1.5 Million parameters.

[162] Straighter Flow Matching via a Diffusion-Based Coupling Prior

Siyu Xing, Jie Cao, Huaibo Huang, Haichao Shi, Xiao-Yu Zhang

Main category: cs.CV

TL;DR: StraightFM is a novel flow matching approach that straightens trajectories for few-step generation using a coupling strategy at the distribution level, achieving high-quality samples in just 5 steps.

Details

Motivation: Existing flow matching methods use either multi-round training or minibatch knowledge, making it challenging to find optimal coupling strategies for straightening trajectories to enable efficient few-step generation.

Method: Uses a diffusion model as a coupling prior to create couplings between images and noise at the entire distribution level, integrating with existing coupling directions from real data to noise.

Result: Experimental results show StraightFM produces attractive samples within 5 steps in both pixel and latent spaces, and is compatible with training-free multimodal conditional generation while maintaining quality.

Conclusion: StraightFM effectively straightens trajectories for efficient few-step generation through distribution-level coupling strategies, demonstrating strong performance across various generation tasks.

Abstract: Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straightening trajectories to few-step generation. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy from the entire distribution level. More specifically, during training, StraightFM creates couplings of images and noise via one diffusion model as a coupling prior to straighten trajectories for few-step generation. Our coupling strategy can also integrate with the existing coupling direction from real data to noise, improving image quality in few-step generation. Experimental results on pixel space and latent space show that StraightFM yields attractive samples within 5 steps. Moreover, our unconditional StraightFM is seamlessly compatible with training-free multimodal conditional generation, maintaining high-quality image generation in few steps.

[163] Style Transfer to Calvin and Hobbes comics using Stable Diffusion

Asvin Kumar Venkataramanan, Sloke Shrestha, Sundar Sripada Venugopalaswamy Sriraman

Main category: cs.CV

TL;DR: Fine-tuned stable diffusion with LoRA on Calvin and Hobbes comics dataset to perform style transfer, achieving visually appealing results with efficient training.

Details

Motivation: To convert any input image into the distinctive comic style of Calvin and Hobbes through style transfer using stable diffusion.

Method: Used stable-diffusion-v1.5 with Low Rank Adaptation (LoRA) for efficient fine-tuning, employing a Variational Autoencoder (VAE) U-net architecture for the diffusion process.

Result: Produced visually appealing style transfer results despite limited training time and input data quality constraints.

Conclusion: Successfully demonstrated that stable diffusion fine-tuning with LoRA can effectively adapt to specific artistic styles like Calvin and Hobbes comics for style transfer applications.

Abstract: This project report summarizes our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics. The purpose is to convert any given input image into the comic style of Calvin and Hobbes, essentially performing style transfer. We train stable-diffusion-v1.5 using Low Rank Adaptation (LoRA) to efficiently speed up the fine-tuning process. The diffusion itself is handled by a Variational Autoencoder (VAE), which is a U-net. Our results were visually appealing for the amount of training time and the quality of input data that went into training.

[164] Replication Study and Benchmarking of Real-Time Object Detection Models

Pierre-Luc Asselin, Vincent Coulombe, William Guimont-Martin, William Larrivée-Hardy

Main category: cs.CV

TL;DR: This paper benchmarks real-time object detection models’ accuracy and speed, reproduces DETR, RTMDet, ViTDet and YOLOv7 from scratch, and finds reproducibility issues with some models while proposing a unified evaluation pipeline.

Details

Motivation: Object detection models are used in real-world applications like robotics where inference time is critical, so simply measuring accuracy is insufficient for proper model comparison.

Method: Compared various object detection models’ accuracy and inference speed on multiple GPUs, reproduced four models from scratch using PyTorch on MS COCO 2017, and proposed a unified training/evaluation pipeline based on MMDetection.

Result: Reproduced DETR and ViTDet couldn’t match original papers’ performance, but RTMDet and YOLOv7 could. Found reproducibility issues in studied papers and reduced speed performance with limited resources. Anchor-free models like RTMDet and YOLOx showed best accuracy-speed trade-off.

Conclusion: The study highlights reproducibility challenges in object detection research and demonstrates the importance of standardized benchmarking for fair model comparison, with anchor-free models showing superior performance in accuracy-speed trade-offs.

Abstract: This work examines the reproducibility and benchmarking of state-of-the-art real-time object detection models. As object detection models are often used in real-world contexts, such as robotics, where inference time is paramount, simply measuring models’ accuracy is not enough to compare them. We thus compare a large variety of object detection models’ accuracy and inference speed on multiple graphics cards. In addition to this large benchmarking attempt, we also reproduce the following models from scratch using PyTorch on the MS COCO 2017 dataset: DETR, RTMDet, ViTDet and YOLOv7. More importantly, we propose a unified training and evaluation pipeline, based on MMDetection’s features, to better compare models. Our implementation of DETR and ViTDet could not achieve accuracy or speed performances comparable to what is declared in the original papers. On the other hand, reproduced RTMDet and YOLOv7 could match such performances. Studied papers are also found to be generally lacking for reproducibility purposes. As for MMDetection pretrained models, speed performances are severely reduced with limited computing resources (larger, more accurate models even more so). Moreover, results exhibit a strong trade-off between accuracy and speed, prevailed by anchor-free models - notably RTMDet or YOLOx models. The code used is this paper and all the experiments is available in the repository at https://github.com/willGuimont/segdet_mlcr2024.

[165] BOSC: A Backdoor-based Framework for Open Set Synthetic Image Attribution

Jun Wang, Benedetta Tondi, Mauro Barni

Main category: cs.CV

TL;DR: BOSC framework for open-set synthetic image attribution using backdoor attacks to enable classifier rejection of unknown architectures

Details

Motivation: Existing methods only work in closed-set scenarios and cannot handle images from unknown generative architectures, which is problematic as new AI models continuously emerge

Method: Inject class-specific triggers into training images to create backdoors, then use the model’s behavior on triggered samples to develop a rejection score for unknown architectures at test time

Result: Good performance that surpasses state-of-the-art methods, with strong robustness against image processing operations

Conclusion: BOSC is an effective general framework for open-set attribution that can be applied to various image forensic tasks beyond synthetic image attribution

Abstract: Synthetic image attribution addresses the problem of tracing back the origin of images produced by generative models. Extensive efforts have been made to explore unique representations of generative models and use them to attribute a synthetic image to the model that produced it. Most of the methods classify the models or the architectures among those in a closed set without considering the possibility that the system is fed with samples produced by unknown architectures. With the continuous progress of AI technology, new generative architectures continuously appear, thus driving the attention of researchers towards the development of tools capable of working in open-set scenarios. In this paper, we propose a framework for open set attribution of synthetic images, named BOSC (Backdoor-based Open Set Classification), that relies on the concept of backdoor attacks to design a classifier with rejection option. BOSC works by purposely injecting class-specific triggers inside a portion of the images in the training set to induce the network to establish a matching between class features and trigger features. The behavior of the trained model with respect to triggered samples is then exploited at test time to perform sample rejection using an ad-hoc score. Experiments show that the proposed method has good performance, always surpassing the state-of-the-art. Robustness against image processing is also very good. Although we designed our method for the task of synthetic image attribution, the proposed framework is a general one and can be used for other image forensic applications.

[166] SPARE: Symmetrized Point-to-Plane Distance for Robust Non-Rigid Registration

Yuxin Yao, Bailin Deng, Junhui Hou, Juyong Zhang

Main category: cs.CV

TL;DR: SPARE is a novel non-rigid registration method that uses symmetrized point-to-plane distance with normals for more accurate geometry approximation, combined with ARAP regularization and deformation graph initialization for efficient optimization.

Details

Motivation: Existing optimization-based non-rigid registration methods using point-to-point or point-to-plane distances suffer from slow convergence and loss of detail, requiring a more robust approach.

Method: Proposes symmetrized point-to-plane distance utilizing both positions and normals, ARAP regularization for deformed normal estimation, alternating minimization with majorization-minimization strategy, and deformation graph-based coarse alignment for initialization.

Result: Extensive experiments show greatly improved registration accuracy while maintaining high solution efficiency compared to existing methods.

Conclusion: SPARE achieves higher accuracy in non-rigid registration through geometric-aware distance metrics and efficient optimization strategies, with publicly available code for reproducibility.

Abstract: Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point-to-plane distance for robust non-rigid registration. The symmetrized point-to-plane distance relies on both the positions and normals of the corresponding points, resulting in a more accurate approximation of the underlying geometry and can achieve higher accuracy than existing methods. To solve this optimization problem efficiently, we introduce an as-rigid-as-possible regulation term to estimate the deformed normals and propose an alternating minimization solver using a majorization-minimization strategy. Moreover, for effective initialization of the solver, we incorporate a deformation graph-based coarse alignment that improves registration quality and efficiency. Extensive experiments show that the proposed method greatly improves the accuracy of non-rigid registration problems and maintains relatively high solution efficiency. The code is publicly available at https://github.com/yaoyx689/spare.

[167] FADE: A Dataset for Detecting Falling Objects around Buildings in Video

Zhigang Tu, Zitao Gao, Zhengbo Zhang, Chunluan Zhou, Junsong Yuan, Bo Du

Main category: cs.CV

TL;DR: Proposed FADE dataset and FADE-Net method for detecting falling objects around buildings in surveillance videos, outperforming previous approaches.

Details

Motivation: Falling objects from buildings pose serious safety risks but are hard to detect manually due to small size, fast motion, and complex backgrounds in surveillance footage.

Method: Created FADE dataset with 1,881 videos from 18 scenes, 8 object categories, 4 weather conditions, and 4 resolutions. Developed FADE-Net that leverages motion information and generates high-quality small proposals.

Result: FADE-Net significantly outperforms previous generic object detection, video object detection, and moving object detection methods on the FADE dataset.

Conclusion: Provides an effective baseline for falling object detection research with publicly available dataset and code, addressing an important safety concern in urban environments.

Abstract: Falling objects from buildings can cause severe injuries to pedestrians due to the great impact force they exert. Although surveillance cameras are installed around some buildings, it is challenging for humans to capture such events in surveillance videos due to the small size and fast motion of falling objects, as well as the complex background. Therefore, it is necessary to develop methods to automatically detect falling objects around buildings in surveillance videos. To facilitate the investigation of falling object detection, we propose a large, diverse video dataset called FADE (FAlling Object DEtection around Buildings) for the first time. FADE contains 1,881 videos from 18 scenes, featuring 8 falling object categories, 4 weather conditions, and 4 video resolutions. Additionally, we develop a new object detection method called FADE-Net, which effectively leverages motion information and produces small-sized but high-quality proposals for detecting falling objects around buildings. Importantly, our method is extensively evaluated and analyzed by comparing it with the previous approaches used for generic object detection, video object detection, and moving object detection on the FADE dataset. Experimental results show that the proposed FADE-Net significantly outperforms other methods, providing an effective baseline for future research. The dataset and code are publicly available at https://fadedataset.github.io/FADE.github.io/.

[168] Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance

Quang-Huy Che, Duc-Tri Le, Bich-Nga Pham, Duc-Khai Lam, Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: A novel data augmentation pipeline using Controllable Diffusion models for semantic segmentation that generates high-quality synthetic images while preserving class structure and ensuring dataset balance.

Details

Motivation: Traditional data augmentation methods lack semantic diversity and fail to alter high-level properties. Generative models offer better augmentation but struggle with maintaining original image content and structure accuracy.

Method: Uses Controllable Diffusion model with Class-Prompt Appending and Visual Prior Blending to enhance attention to labeled classes. Includes class balancing algorithm for merging synthetic and original images.

Result: Demonstrated effectiveness on PASCAL VOC datasets, generating high-quality synthetic images that preserve segmentation-labeled class structure.

Conclusion: The proposed pipeline effectively addresses limitations of traditional and generative augmentation methods, providing precise control over synthetic image generation while maintaining dataset balance and structural integrity.

Abstract: Data augmentation is crucial for pixel-wise annotation tasks like semantic segmentation, where labeling requires significant effort and intensive labor. Traditional methods, involving simple transformations such as rotations and flips, create new images but often lack diversity along key semantic dimensions and fail to alter high-level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable Generative models offer data augmentation methods for semantic segmentation tasks by using prompts and visual references from the original image. However, these models face challenges in generating synthetic images that accurately reflect the content and structure of the original image due to difficulties in creating effective prompts and visual references. In this work, we introduce an effective data augmentation pipeline for semantic segmentation using Controllable Diffusion model. Our proposed method includes efficient prompt generation using Class-Prompt Appending and Visual Prior Blending to enhance attention to labeled classes in real images, allowing the pipeline to generate a precise number of augmented images while preserving the structure of segmentation-labeled classes. In addition, we implement a class balancing algorithm to ensure a balanced training dataset when merging the synthetic and original images. Evaluation on PASCAL VOC datasets, our pipeline demonstrates its effectiveness in generating high-quality synthetic images for semantic segmentation. Our code is available at https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance.

[169] Hardware-Friendly Diffusion Models with Fixed-Size Reusable Structures for On-Device Image Generation

Sanchar Palit, Sathya Veera Reddy Dendi, Mallikarjuna Talluri, Raj Narayana Gadde

Main category: cs.CV

TL;DR: Proposed a fixed-size transformer-based architecture for diffusion models that eliminates positional embeddings and variable-sized blocks, making it more hardware-friendly for mobile deployment while achieving competitive performance.

Details

Motivation: Vision Transformers require positional embeddings and U-Net architectures use variable-sized blocks, both presenting challenges for on-device implementation of diffusion models on resource-constrained hardware.

Method: Developed an architecture using fixed-size, reusable transformer blocks as core structure without positional embeddings, featuring token-free design, uniformity, and scalability for hardware optimization.

Result: Achieved state-of-the-art FID score of 1.6 on unconditional image generation with CelebA, with competitive and consistent performance across both unconditional and conditional image generation tasks.

Conclusion: The proposed architecture offers low complexity, hardware-friendly design suitable for mobile and resource-constrained devices while maintaining strong performance in diffusion-based image generation.

Abstract: Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens processed by the transformer, although they offer the advantage of using fixed-size, reusable repetitive blocks following tokenization. The U-Net architecture lacks these attributes, as it utilizes variable-sized intermediate blocks for down-convolution and up-convolution in the noise estimation backbone for the diffusion process. To address these issues, we propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure, making it more suitable for hardware implementation. Our architecture is characterized by low complexity, token-free design, absence of positional embeddings, uniformity, and scalability, making it highly suitable for deployment on mobile and resource-constrained devices. The proposed model exhibit competitive and consistent performance across both unconditional and conditional image generation tasks. The model achieved a state-of-the-art FID score of 1.6 on unconditional image generation with the CelebA.

[170] MUNBa: Machine Unlearning via Nash Bargaining

Jing Wu, Mehrtash Harandi

Main category: cs.CV

TL;DR: Machine Unlearning reformulated as a two-player cooperative game using Nash bargaining theory to resolve gradient conflicts between forgetting and preservation objectives, achieving better trade-offs than state-of-the-art methods.

Details

Motivation: Naive integration of forgetting and preserving objectives in Machine Unlearning leads to gradient conflicts and dominance, preventing optimal solutions.

Method: Reformulate MU as a two-player cooperative game (forgetting player and preservation player) and derive a closed-form solution using Nash bargaining theory to guide models to Pareto stationary points.

Result: Outperforms state-of-the-art MU algorithms across ResNet, CLIP, and diffusion models, achieving better trade-off between forgetting and preserving with improved forgetting precision, generalization preservation, and adversarial robustness.

Conclusion: The game-theoretic approach provides equilibrium solutions that ensure optimality in both objectives, making it an effective framework for Machine Unlearning tasks.

Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm’s effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

[171] OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang

Main category: cs.CV

TL;DR: OFTSR is a one-step flow-based super-resolution framework that achieves tunable fidelity-realism trade-off through teacher-student distillation with ODE trajectory alignment.

Details

Motivation: Existing diffusion/flow-based methods require multiple sampling steps (computationally expensive) or fixed distillation approaches that lack flexibility in fidelity-realism trade-off.

Method: Train conditional flow-based teacher model, then distill it to one-step student by forcing student predictions to lie on teacher’s ODE sampling trajectory from intermediate states.

Result: State-of-the-art performance on FFHQ, DIV2K, and ImageNet datasets for one-step super-resolution with flexible fidelity-realism tuning capability.

Conclusion: OFTSR successfully addresses computational overhead and flexibility limitations of previous methods, achieving efficient one-step super-resolution with tunable quality trade-offs.

Abstract: Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on common model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model’s single-step predictions from initial states match the teacher’s predictions from a closer intermediate state. Through extensive experiments on datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Codes: \href{https://github.com/yuanzhi-zhu/OFTSR}{https://github.com/yuanzhi-zhu/OFTSR}.

[172] Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong

Main category: cs.CV

TL;DR: DPS is a black-box training-free defense method that uses partial image responses to supervise LVLM outputs, reducing attack success rates by 76.3% while maintaining clean image performance.

Details

Motivation: Existing defense methods using image cropping and majority voting degrade response quality on clean images due to semantic distortion from partial images.

Method: Proposes DPS (Defense through Partial-Perception Supervision) where a weak model’s responses from partial images supervise the strong model’s responses to original images, enabling attack detection and response adjustment.

Result: Outperforms baselines by reducing average attack success rate by 76.3% across six datasets on three popular LVLMs while maintaining clean input performance.

Conclusion: Weak models can effectively supervise strong models for defense, with DPS providing robust protection against vision attacks without compromising clean image response quality.

Abstract: Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM’s responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model’s partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

[173] Sat-DN: Implicit Surface Reconstruction from Multi-View Satellite Images with Depth and Normal Supervision

Tianle Liu, Shuangming Zhao, Wanshou Jiang, Bingxuan Guo

Main category: cs.CV

TL;DR: Sat-DN is a novel framework for high-quality satellite image reconstruction using multi-resolution hash grid with depth guidance and normal constraints, achieving faster training and better results than existing methods.

Details

Motivation: Traditional stereo matching fails to capture fine details in satellite imagery, while NeRFs have prohibitively long training times. Challenges include low facade visibility, illumination differences, and weakly textured regions that hinder terrain and building reconstruction.

Method: Uses progressively trained multi-resolution hash grid architecture with explicit depth guidance and surface normal consistency constraints. Progressive strategy increases learning frequency incrementally, using coarse geometry to guide fine detail reconstruction.

Result: Extensive experiments on DFC2019 dataset show Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations with faster training times.

Conclusion: The proposed framework effectively addresses satellite image reconstruction challenges, providing high-quality terrain geometry and detailed building facades with improved efficiency and accuracy compared to current approaches.

Abstract: With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at https://github.com/costune/SatDN.

[174] Image Embedding Sampling Method for Diverse Captioning

Sania Waheed, Na Min An

Main category: cs.CV

TL;DR: Training-free framework that enhances small VLM caption diversity by using structured segmentation to capture both global and local image semantics, achieving performance comparable to larger models.

Details

Motivation: Address the trade-off between computational complexity and caption quality in VLMs, making detailed image captioning more accessible for resource-constrained applications.

Method: Leverages structured segmentation with BLIP backbone to produce hierarchical representations without additional training, explicitly attending to distinct image regions.

Result: Achieves Div-2 scores of 0.735 (MSCOCO), 0.750 (Flickr30k), and 0.748 (Nocaps) while maintaining strong relevancy and semantic integrity with human annotations.

Conclusion: Smaller VLMs can achieve comparable performance to larger models through structured segmentation approaches without requiring additional training, making detailed captioning more accessible.

Abstract: Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

[175] CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection

Zhe Huang, Shuo Wang, Yongcai Wang, Lei Wang

Main category: cs.CV

TL;DR: CoDiff is a novel collaborative 3D object detection framework that uses diffusion models to denoise and refine multi-agent feature representations, addressing spatial and temporal noise from pose estimation errors and time delays.

Details

Motivation: Collaborative perception in autonomous driving suffers from feature noise due to pose estimation errors and time delays during information exchange between agents, leading to detection errors. Diffusion models' natural denoising capability makes them suitable for this problem.

Method: Projects high-dimensional feature maps into a pre-trained autoencoder’s latent space, uses individual agent information as conditioning to guide diffusion model sampling, and progressively denoises coarse feature maps to refine fused features.

Result: Outperforms existing methods on both simulated and real-world datasets, demonstrating superior collaborative object detection performance and high robustness against high-level noise in pose and delay information.

Conclusion: CoDiff successfully applies diffusion models to multi-agent collaborative perception for the first time, providing an effective solution to noise problems in collaborative 3D object detection with improved performance and robustness.

Abstract: Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi-agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi-agent collaborative perception. Specifically, we project high-dimensional feature map into the latent space of a powerful pre-trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model’s sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high-level noise. The code is released at https://github.com/HuangZhe885/CoDiff

[176] ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada

Main category: cs.CV

TL;DR: Real-time speech-driven 3D facial animation using autoregressive model with multi-scale motion codebook for lip sync, head poses, and eye blinks, with style adaptation capability.

Details

Motivation: Existing diffusion-based methods for speech-driven 3D facial animation produce natural motions but are too slow for real-time applications, limiting their practical use.

Method: Novel autoregressive model that learns mapping from speech to multi-scale motion codebook to generate synchronized lip movements, head poses, and eye blinks in real-time.

Result: Outperforms existing approaches in lip synchronization accuracy and perceived quality, can adapt to unseen speaking styles for creating unique 3D talking avatars.

Conclusion: Proposed method enables real-time generation of high-quality 3D facial animation with style adaptation, overcoming speed limitations of previous diffusion-based approaches.

Abstract: Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

[177] Fast rigid alignment of heterogeneous images in sliced Wasserstein distance

Yunpeng Shi, Amit Singer, Eric J. Verbeke

Main category: cs.CV

TL;DR: Fast image alignment algorithm using optimal transport and sliced Wasserstein distance with O(L² log L) complexity

Details

Motivation: Many computer vision applications require aligning similar but non-identical images efficiently and robustly

Method: Combines fast Fourier methods with sliced probability metrics to compute alignment using sliced 2-Wasserstein distance

Result: Achieves O(L² log L) operations for L×L images, robust to translations, rotations and deformations

Conclusion: Proposed method provides efficient and robust image alignment through optimal transport techniques

Abstract: Many applications of computer vision rely on the alignment of similar but non-identical images. We present a fast algorithm for aligning heterogeneous images based on optimal transport. Our approach combines the speed of fast Fourier methods with the robustness of sliced probability metrics and allows us to efficiently compute the alignment between two $L \times L$ images using the sliced 2-Wasserstein distance in $O(L^2 \log L)$ operations. We show that our method is robust to translations, rotations and deformations in the images.

[178] Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

Main category: cs.CV

TL;DR: CT-Scroll: A novel global-local attention model that emulates radiologists’ scrolling behavior for multi-label classification of 3D CT scans, addressing limitations of CNNs and Vision Transformers.

Details

Motivation: The increasing number of CT scans requires automated tools to assist radiologists. Existing methods struggle with long-range dependencies, require extensive pre-training, and fail to model radiologists' navigational scrolling behavior during CT analysis.

Method: CT-Scroll uses a global-local attention mechanism specifically designed to emulate how radiologists scroll through CT scan slices, combining both global context understanding and local detail awareness.

Result: The approach was evaluated on two public datasets with comprehensive experiments and ablation studies showing the efficacy of the model and the contribution of each component.

Conclusion: CT-Scroll effectively addresses the challenges of 3D CT scan analysis by modeling radiologists’ scrolling behavior, outperforming traditional CNN and Vision Transformer approaches without requiring extensive pre-training.

Abstract: The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist’s navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.

[179] Transferable Mask Transformer: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation

Jianhua Liu, Zhengyu Li, Yanru Wu, Jingge Wang, Yang Tan, Ruizhe Zhao, Guan Wang, Yang Li

Main category: cs.CV

TL;DR: TMT is a region-level domain adaptation framework for Vision Transformers that improves semantic segmentation performance by dynamically identifying and prioritizing adaptation in low-transferability regions through spatial transferability analysis and masked attention mechanisms.

Details

Motivation: Pretrained Vision Transformers suffer performance degradation when adapted to new domains due to distribution shifts causing suboptimal global attention, as self-attention mechanisms are data-driven and fail with domain differences in texture, scale, or object co-occurrence patterns.

Method: Transferable Mask Transformer (TMT) with two components: Adaptive Cluster-based Transferability Estimator (ACTE) for dynamic region segmentation and transferability assessment, and Transferable Masked Attention (TMA) module that integrates transferability maps into attention mechanisms to prioritize adaptation in low-transferability regions.

Result: Comprehensive evaluations across 20 cross-domain pairs show TMT achieves 2% MIoU improvement over vanilla fine-tuning and 1.28% increase compared to state-of-the-art baselines.

Conclusion: TMT effectively addresses domain adaptation challenges in semantic segmentation by focusing on region-level adaptation with dynamically shaped regions, demonstrating superior performance through spatial transferability analysis and targeted attention mechanisms.

Abstract: Recent advances in Vision Transformers (ViTs) have set new benchmarks in semantic segmentation. However, when adapting pretrained ViTs to new target domains, significant performance degradation often occurs due to distribution shifts, resulting in suboptimal global attention. Since self-attention mechanisms are inherently data-driven, they may fail to effectively attend to key objects when source and target domains exhibit differences in texture, scale, or object co-occurrence patterns. While global and patch-level domain adaptation methods provide partial solutions, region-level adaptation with dynamically shaped regions is crucial due to spatial heterogeneity in transferability across different image areas. We present Transferable Mask Transformer (TMT), a novel region-level adaptation framework for semantic segmentation that aligns cross-domain representations through spatial transferability analysis. TMT consists of two key components: (1) An Adaptive Cluster-based Transferability Estimator (ACTE) that dynamically segments images into structurally and semantically coherent regions for localized transferability assessment, and (2) A Transferable Masked Attention (TMA) module that integrates region-specific transferability maps into ViTs' attention mechanisms, prioritizing adaptation in regions with low transferability and high semantic uncertainty. Comprehensive evaluations across 20 cross-domain pairs demonstrate TMT’s superiority, achieving an average 2% MIoU improvement over vanilla fine-tuning and a 1.28% increase compared to state-of-the-art baselines. The source code will be publicly available.

[180] Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

Haotong Cheng, Zhiqi Zhang, Hao Li, Xinshang Zhang

Main category: cs.CV

TL;DR: This paper introduces the concept of “Universality” to quantify module transferability in SISR, proposes a Universality Assessment Equation (UAE) metric, and designs two optimized modules (CRB and DCRB) that outperform state-of-the-art methods with significant PSNR improvements or parameter reductions.

Details

Motivation: Existing SISR research focuses on raw performance gains but neglects quantifying the transferability of architectural components across models and tasks.

Method: Proposed Universality concept and UAE metric to quantify module transferability, then designed Cycle Residual Block (CRB) and Depth-Wise Cycle Residual Block (DCRB) based on UAE analysis of standard modules.

Result: Networks with proposed modules outperform SOTA methods, achieving up to 0.83 dB PSNR improvement or 71.3% parameter reduction with negligible fidelity loss across natural-scene benchmarks, remote-sensing datasets, and low-level tasks.

Conclusion: The universality-based optimization approach provides a new paradigm for designing plug-and-play modules that can be applied to various basic modules beyond SISR.

Abstract: Deep learning has substantially advanced the field of Single Image Super-Resolution (SISR). However, existing research has predominantly focused on raw performance gains, with little attention paid to quantifying the transferability of architectural components. In this paper, we introduce the concept of “Universality” and its associated definitions, which extend the traditional notion of “Generalization” to encompass the ease of transferability of modules. We then propose the Universality Assessment Equation (UAE), a metric that quantifies how readily a given module can be transplanted across models and reveals the combined influence of multiple existing metrics on transferability. Guided by the UAE results of standard residual blocks and other plug-and-play modules, we further design two optimized modules: the Cycle Residual Block (CRB) and the Depth-Wise Cycle Residual Block (DCRB). Through comprehensive experiments on natural-scene benchmarks, remote-sensing datasets, and other low-level tasks, we demonstrate that networks embedded with the proposed plug-and-play modules outperform several state-of-the-art methods, achieving a PSNR improvement of up to 0.83 dB or enabling a 71.3% reduction in parameters with negligible loss in reconstruction fidelity. Similar optimization approaches could be applied to a broader range of basic modules, offering a new paradigm for the design of plug-and-play modules.

[181] POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

Evans Xu Han, Alice Qian Zhang, Haiyi Zhu, Hong Shen, Paul Pu Liang, Jane Hsieh

Main category: cs.CV

TL;DR: POET is an interactive tool that enhances text-to-image generation by discovering homogeneity dimensions, diversifying outputs, and personalizing through user feedback to support creative ideation.

Details

Motivation: Current text-to-image systems produce conventional outputs and lack personalization, limiting creative exploration for diverse user needs in early ideation stages.

Method: POET automatically discovers dimensions of homogeneity in generative models, expands these dimensions to diversify outputs, and learns from user feedback to personalize expansions.

Result: Evaluation with 28 users across four creative domains showed POET generates higher perceived diversity, helps users reach satisfaction faster, and encourages deliberation on a wider range of results.

Conclusion: POET demonstrates how future text-to-image tools can better support pluralistic values and user needs during creative ideation through interactive personalization and diversification techniques.

Abstract: State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks – offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET’s ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.

[182] Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking

Daniela Ruiz, Paula Cárdenas, Leonardo Manrique, Daniela Vega, Gabriel M. Mejia, Pablo Arbeláez

Main category: cs.CV

TL;DR: SpaRED provides standardized benchmark for gene expression prediction from histology images, while SpaCKLE transformer model reduces prediction error by 82.5% and improves all existing models.

Details

Motivation: Address limitations of Visium Spatial Transcriptomics (high cost, expertise requirements, slow clinical integration, gene capture inefficiencies) and inconsistent evaluation standards in existing research.

Method: Created SpaRED - systematically curated database of 26 public datasets with standardized preprocessing. Developed SpaCKLE - transformer-based gene expression completion model. Established benchmark evaluating 8 state-of-the-art prediction models on both raw and SpaCKLE-completed data.

Result: SpaCKLE reduces mean squared error by over 82.5% compared to existing approaches. SpaCKLE substantially improves results across all gene expression prediction models when used with completed data.

Conclusion: This work provides the most comprehensive benchmark for gene expression prediction from histology images and serves as a foundation for future Spatial Transcriptomics research.

Abstract: Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.

[183] Res-MoCoDiff: Residual-guided diffusion models for motion artifact correction in brain MRI

Mojtaba Safari, Shansong Wang, Qiang Li, Zach Eidex, Richard L. J. Qiu, Chih-Wei Chang, Hui Mao, Xiaofeng Yang

Main category: cs.CV

TL;DR: Res-MoCoDiff is an efficient diffusion model for MRI motion artifact correction that uses residual error shifting and achieves superior performance with only 4 reverse diffusion steps, reducing processing time significantly.

Details

Motivation: Motion artifacts in brain MRI degrade image quality and hinder applications. Conventional methods like repeated acquisitions or motion tracking impose workflow burdens, creating need for efficient correction methods.

Method: Uses residual error shifting mechanism during forward diffusion to match corrupted data distribution. Employs U-net with Swin Transformer blocks and combined l1+l2 loss. Requires only 4 reverse diffusion steps.

Result: Superior performance across all distortion levels. Achieved highest SSIM and lowest NMSE, with PSNR up to 41.91±2.94 dB for minor distortions. Reduced sampling time to 0.37 seconds per batch vs 101.74 seconds for conventional methods.

Conclusion: Res-MoCoDiff is an efficient and effective solution for MRI motion artifact correction, offering significant time savings while maintaining high image quality across various distortion levels.

Abstract: Objective. Motion artifacts in brain MRI, mainly from rigid head motion, degrade image quality and hinder downstream applications. Conventional methods to mitigate these artifacts, including repeated acquisitions or motion tracking, impose workflow burdens. This study introduces Res-MoCoDiff, an efficient denoising diffusion probabilistic model specifically designed for MRI motion artifact correction.Approach.Res-MoCoDiff exploits a novel residual error shifting mechanism during the forward diffusion process to incorporate information from motion-corrupted images. This mechanism allows the model to simulate the evolution of noise with a probability distribution closely matching that of the corrupted data, enabling a reverse diffusion process that requires only four steps. The model employs a U-net backbone, with attention layers replaced by Swin Transformer blocks, to enhance robustness across resolutions. Furthermore, the training process integrates a combined l1+l2 loss function, which promotes image sharpness and reduces pixel-level errors. Res-MoCoDiff was evaluated on both an in-silico dataset generated using a realistic motion simulation framework and an in-vivo MR-ART dataset. Comparative analyses were conducted against established methods, including CycleGAN, Pix2pix, and a diffusion model with a vision transformer backbone, using quantitative metrics such as PSNR, SSIM, and NMSE.Main results. The proposed method demonstrated superior performance in removing motion artifacts across minor, moderate, and heavy distortion levels. Res-MoCoDiff consistently achieved the highest SSIM and the lowest NMSE values, with a PSNR of up to 41.91+-2.94 dB for minor distortions. Notably, the average sampling time was reduced to 0.37 seconds per batch of two image slices, compared with 101.74 seconds for conventional approaches.

[184] Understanding Space Is Rocket Science – Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Nils Hoehing, Mayug Maniparambil, Ellen Rushe, Noel E. O’Connor, Anthony Ventresque

Main category: cs.CV

TL;DR: RocketScience is a new benchmark that tests spatial relation understanding in VLMs, showing current models struggle with spatial reasoning while humans excel.

Details

Motivation: To create a challenging benchmark that specifically tests spatial relation understanding in vision-language models, which is easy for humans but difficult for current VLMs.

Method: Developed a new dataset of real-world image-text pairs focusing on relative spatial relationships and object order, then evaluated various open-source and commercial VLMs along with reasoning models.

Result: Current VLMs show striking lack of spatial relation understanding, while reasoning models perform surprisingly well. Performance is bottlenecked by spatial reasoning capabilities rather than object localization.

Conclusion: Spatial reasoning remains a significant challenge for current VLMs, and the RocketScience benchmark provides a valuable tool for evaluating and improving spatial understanding in multimodal AI systems.

Abstract: We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience

[185] Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods, Datasets, and Future Directions

Ruonan Lin, Tao Tang, Yongtai Liu, Wenye Zhou, Xin Yang, Hao Zheng, Jianpu Lin, Yi Zhang

Main category: cs.CV

TL;DR: This paper reviews 147 studies on vision-based traffic accident anticipation (Vision-TAA), categorizing current deep learning approaches and identifying challenges like data scarcity and limited generalization, while suggesting future directions including multi-modal fusion and Transformer architectures.

Details

Motivation: To enhance road safety through comprehensive review of vision-based traffic accident prediction methods, addressing the need for systematic analysis of deep learning approaches in this critical domain.

Method: Systematic review of 147 recent studies, categorizing methodologies into four approaches: image/video feature-based prediction, spatio-temporal feature-based prediction, scene understanding, and multi-modal data fusion.

Result: Identified significant potential in current deep learning methods but highlighted persistent challenges including data scarcity, limited generalization to complex scenarios, and real-time performance constraints.

Conclusion: The review provides foundational reference for developing robust Vision-TAA systems and suggests future research opportunities in multi-modal data fusion, self-supervised learning, and Transformer-based architectures to improve prediction accuracy and scalability.

Abstract: Traffic accident prediction and detection are critical for enhancing road safety, and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep learning. This paper reviews 147 recent studies, focusing on the application of supervised, unsupervised, and hybrid deep learning models for accident prediction, alongside the use of real-world and synthetic datasets. Current methodologies are categorized into four key approaches: image and video feature-based prediction, spatio-temporal feature-based prediction, scene understanding, and multi modal data fusion. While these methods demonstrate significant potential, challenges such as data scarcity, limited generalization to complex scenarios, and real-time performance constraints remain prevalent. This review highlights opportunities for future research, including the integration of multi modal data fusion, self-supervised learning, and Transformer-based architectures to enhance prediction accuracy and scalability. By synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems, contributing to road safety and traffic management.

[186] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

Yang Zheng, Wen Li, Zhaoqiang Liu

Main category: cs.CV

TL;DR: Proposes DMILO and DMILO-PGD methods to improve diffusion model-based inverse problem solving by reducing computational burden and improving convergence through intermediate layer optimization and projected gradient descent.

Details

Motivation: Existing diffusion model-based methods for inverse problems suffer from heavy computational demands and suboptimal convergence issues, limiting their practical application.

Method: DMILO uses intermediate layer optimization to reduce memory burden from DMPlug, and DMILO-PGD integrates ILO with projected gradient descent to prevent suboptimal convergence. Both methods employ sparse deviations to expand the range of diffusion models.

Result: Extensive experiments on diverse image datasets show significant performance gains over state-of-the-art methods for both linear and nonlinear inverse problems.

Conclusion: DMILO and DMILO-PGD effectively address computational and convergence challenges in diffusion model-based inverse problem solvers, demonstrating superior performance compared to existing approaches.

Abstract: Inverse problems (IPs) involve reconstructing signals from noisy observations. Recently, diffusion models (DMs) have emerged as a powerful framework for solving IPs, achieving remarkable reconstruction performance. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approaches under appropriate conditions and validate their superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.

[187] Demographic-aware fine-grained classification of pediatric wrist fractures

Ammar Ahmed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota

Main category: cs.CV

TL;DR: This paper presents a novel approach for wrist pathology recognition using fine-grained transformers, metadata integration with X-rays, and fine-grained pre-training, achieving significant accuracy improvements over traditional methods.

Details

Motivation: Wrist pathologies are common, especially in children, but medical imaging datasets are limited. Relying solely on image data is insufficient given the availability of diverse data types, necessitating a multimodal approach.

Method: Framed as fine-grained recognition task, fused patient metadata with X-ray images, used fine-grained pre-training weights instead of coarse-grained datasets like ImageNet, and employed transformer architecture.

Result: Combining fine-grained transformer approach with metadata integration improved diagnostic accuracy by 2% on a small custom dataset and over 10% on a larger fracture dataset.

Conclusion: The integration of metadata with medical images and fine-grained pre-training significantly enhances wrist pathology recognition accuracy, representing the first application of metadata fusion in this domain.

Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. This study addresses the problem using a multifaceted approach: framing it as a fine-grained recognition task, fusing patient metadata with X-rays, and leveraging weights from a separate fine-grained dataset rather than from a coarse-grained dataset like ImageNet. Unlike prior work, this is the first application of metadata integration for wrist pathology recognition. Our results show that combining fine-grained transformer approach, fine-grained pre-training, and metadata integration improves diagnostic accuracy by 2% on small custom curated dataset and over 10% on a larger fracture dataset.

[188] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu

Main category: cs.CV

TL;DR: A diffusion framework for photorealistic portrait animation that uses human preference optimization and temporal motion modulation to improve lip sync, facial expressions, and body motion dynamics.

Details

Motivation: Addressing challenges in generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion, particularly the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics.

Method: Two key innovations: 1) Direct preference optimization tailored for human-centric animation using curated human preference data, 2) Temporal motion modulation that resolves spatiotemporal resolution mismatches through temporal channel redistribution and proportional feature expansion.

Result: Experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics.

Conclusion: The proposed framework effectively addresses portrait animation challenges through human-preference alignment and temporal motion modulation, showing significant improvements in both technical metrics and human perceptual quality.

Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.

Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

Main category: cs.CV

TL;DR: Proposes GARF, a unified 2D pre-trained multi-modal network for 3D visual grounding that reduces parameters by 58% while improving performance by ~6.5% on detection and grounding tasks.

Details

Motivation: Existing 3D visual grounding methods use separate encoders for different modalities, resulting in large, complex, and inefficient models that struggle with aligning point cloud data to 2D encoders.

Method: Uses a unified 2D CLIP bi-modal model with adapter-based fine-tuning, plus a Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module to fuse geometric multi-scale features from point clouds and images, integrated with textual features and a multi-modal decoder.

Result: Reduces trainable parameters by approximately 58% while achieving 6.52% improvement in 3D detection and 6.25% improvement in 3D visual grounding compared to baseline.

Conclusion: The proposed method enables unified feature extraction and fusion across three modalities (RGB images, text, point clouds) in an end-to-end 3D visual grounding model, significantly simplifying architecture while improving performance.

Abstract: 3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.

[190] Vision-Based Autonomous MM-Wave Reflector Using ArUco-Driven Angle-of-Arrival Estimation

Josue Marroquin, Nan Inzali, Miles Dillon Lantz, Campbell Freeman, Amod Ashtekar, \Ajinkya Umesh Mulik, Mohammed E Eltayeb

Main category: cs.CV

TL;DR: Vision-aided autonomous reflector system enhances mmWave NLoS communication by dynamically steering signal reflections using motorized metallic plate and camera-based target detection.

Details

Motivation: Reliable mmWave communication in non-line-of-sight conditions is challenging in urban/infrastructure-limited environments for both military and civilian operations.

Method: Uses monocular camera to detect ArUco markers on transmitter/receiver nodes, estimate angles of arrival, and align reflector in real-time for optimal signal redirection. Built on Raspberry Pi 4 with low-power hardware.

Result: 23 dB average gain in received signal strength and 0.89 probability of maintaining signal reception above -65 dB threshold in indoor environment, significantly outperforming static and no-reflector baselines.

Conclusion: System demonstrates potential for resilient and adaptive mmWave connectivity in complex dynamic environments with selective beam coverage for authenticated targets.

Abstract: Reliable millimeter-wave (mmWave) communication in non-line-of-sight (NLoS) conditions remains a major challenge for both military and civilian operations, especially in urban or infrastructure-limited environments. This paper presents a vision-aided autonomous reflector system designed to enhance mmWave link performance by dynamically steering signal reflections using a motorized metallic plate. The proposed system leverages a monocular camera to detect ArUco markers on allied transmitter and receiver nodes, estimate their angles of arrival, and align the reflector in real time for optimal signal redirection. This approach enables selective beam coverage by serving only authenticated targets with visible markers and reduces the risk of unintended signal exposure. The designed prototype, built on a Raspberry Pi 4 and low-power hardware, operates autonomously without reliance on external infrastructure or GPS. Experimental results at 60,GHz demonstrate a 23,dB average gain in received signal strength and an 0.89 probability of maintaining signal reception above a target threshold of -65 dB in an indoor environment, far exceeding the static and no-reflector baselines. These results demonstrate the system’s potential for resilient and adaptive mmWave connectivity in complex and dynamic environments.

[191] Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: A video compression framework using conditional diffusion models that outperforms traditional and neural codecs on perceptual quality metrics, particularly at high compression ratios.

Details

Motivation: Leverage the perceptual strengths of conditional diffusion models for video reconstruction to create a more human-perception-aligned compression system.

Method: Reframe video compression as conditional generation with three key modules: multi-granular conditioning, compact representations, and multi-condition training with modality dropout and role-aware embeddings.

Result: Significantly outperforms both traditional and neural codecs on perceptual quality metrics (FVD, LPIPS), especially under high compression ratios.

Conclusion: Conditional diffusion models provide an effective framework for perceptually optimized video compression, demonstrating superior performance over existing approaches.

Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

[192] Ecological Legacies of Pre-Columbian Settlements Evident in Palm Clusters of Neotropical Mountain Forests

Sebastian Fajardo, Sina Mohammadi, Jonas Gregorio de Souza, César Ardila, Alan Tapscott Baltar, Shaddai Heidgen, Maria Isabel Mayorga Hernández, Sylvia Mota de Oliveira, Fernando Montejo, Marco Moderato, Vinicius Peripato, Katy Puche, Carlos Reina, Juan Carlos Vargas, Frank W. Takes, Marco Madella

Main category: cs.CV

TL;DR: Deep learning and remote sensing reveal pre-Columbian forest modification through palm tree distributions near archaeological sites in Colombia, showing human influence extended much farther than previously documented.

Details

Motivation: To understand the spatial extent of ancient human ecological influence on Neotropical forests, which remains underexplored at high resolution despite known transformations by pre-Columbian populations.

Method: Used deep learning and remote sensing approach on high-resolution satellite imagery from Sierra Nevada de Santa Marta, Colombia, to analyze palm tree distributions in relation to archaeological infrastructure.

Result: Palms were significantly more abundant near archaeological sites with large infrastructure investment. The largest palm cluster suggests ancient human-managed areas may be up to two orders of magnitude larger than current archaeological evidence indicates.

Conclusion: Pre-Columbian populations significantly influenced vegetation, creating conditions favorable for palm proliferation and leaving lasting ecological footprints that may have reduced logistical costs for establishing infrastructure-heavy settlements in remote locations.

Abstract: Ancient populations markedly transformed Neotropical forests, yet the spatial extent of their ecological influence remains underexplored at high resolution. Here we present a deep learning and remote sensing based approach to estimate areas of pre-Columbian forest modification based on modern vegetation. We apply this method to high-resolution satellite imagery from the Sierra Nevada de Santa Marta, Colombia, as a demonstration of a scalable approach, to evaluate palm tree distributions in relation to archaeological infrastructure. Palms were significantly more abundant near archaeological sites with large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by current archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced vegetation, fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in less accessible locations.

[193] LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, Marco Cristani

Main category: cs.CV

TL;DR: LOTS is a novel approach for fashion image generation that combines global descriptions with localized sketch-text pairs using a diffusion model with step-based merging strategy and attention-based guidance.

Details

Motivation: Fashion design requires blending visual sketches and textual descriptions, but existing methods lack effective integration of localized sketch-text information for complete fashion outlook generation.

Method: Uses Modularized Pair-Centric representation to encode sketches and text into shared latent space, then Diffusion Pair Guidance with attention-based guidance during multi-step denoising process in diffusion model.

Result: Achieves state-of-the-art performance on both global and localized metrics, with qualitative examples and human evaluation showing unprecedented design customization capabilities.

Conclusion: LOTS provides an effective framework for compositional sketch-text based fashion image generation, enabling high levels of design customization through localized conditioning.

Abstract: Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

[194] Foundations and Models in Modern Computer Vision: Key Building Blocks in Landmark Architectures

Radu-Andrei Bourceanu, Neil De La Fuente, Jan Grimm, Andrei Jardan, Andriy Manucharyan, Cornelius Weiss, Daniel Cremers, Roman Pflugfelder

Main category: cs.CV

TL;DR: Analysis of 6 influential computer vision papers covering ResNet, ViT, GANs, LDMs, DINO, and MAE, tracing evolution from foundational recognition to generative models and self-supervised learning.

Details

Motivation: To examine the evolutionary trajectory of key design patterns in computer vision by analyzing foundational architectures, generative models, and self-supervised learning techniques that have shaped the field.

Method: Comparative analysis of six landmark papers: ResNet (residual connections), ViT (Transformer for vision), GANs (adversarial training), LDMs (latent diffusion), DINO (self-distillation), and MAE (masked autoencoding).

Result: Identifies progression from solving vanishing gradients (ResNet) to attention-based paradigms (ViT), then to generative modeling (GANs, LDMs), and finally to label-efficient self-supervised methods (DINO, MAE).

Conclusion: Computer vision has evolved through distinct phases: deeper networks, attention mechanisms, generative modeling, and self-supervised learning, with each innovation building upon previous foundations to address specific challenges in the field.

Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analysis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer architecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recognition. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

[195] TexVerse: A Universe of 3D Objects with High-Resolution Textures

Yibo Zhang, Li Zhang, Rui Ma, Nan Cao

Main category: cs.CV

TL;DR: TexVerse is a large-scale 3D dataset with high-resolution textures, featuring over 858K unique models (1.6M total instances) including PBR materials, rigged models, and animations with detailed annotations.

Details

Motivation: Address the lack of suitable large-scale datasets for high-resolution texture generation in 3D models, as existing datasets focus on geometry but neglect texture quality and variety.

Method: Curated collection of over 858K high-resolution 3D models from Sketchfab, including specialized subsets for rigged models (TexVerse-Skeleton) and animated models (TexVerse-Animation), with detailed annotations describing model characteristics and features.

Result: Created a comprehensive dataset with 158K+ PBR material models, 69K rigged models, and 54K animated models, preserving original skeleton and animation data with high-resolution texture variants.

Conclusion: TexVerse provides a high-quality resource for texture synthesis, PBR material development, animation, and various 3D vision/graphics tasks, filling a critical gap in 3D data availability.

Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.

[196] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

Main category: cs.CV

TL;DR: DianJin-OCR-R1 is a reasoning-enhanced framework that combines vision-language models with expert OCR tools to reduce hallucinations and improve document recognition accuracy.

Details

Motivation: Large vision-language models suffer from hallucinations and underperform specialized OCR models on domain-specific tasks, needing a solution that leverages both general capabilities and expert precision.

Method: A reasoning-and-tool interleaved framework where the model first uses its own OCR, then calls expert tools for references, and finally re-examines the image to provide final recognition content.

Result: Outperforms both non-reasoning counterparts and expert OCR models on ReST and OmniDocBench benchmarks, demonstrating reduced hallucinations and improved accuracy.

Conclusion: Combining VLMs with expert models through reasoning-enhanced frameworks effectively mitigates hallucinations and enhances OCR performance while enabling easier iteration through smaller expert models.

Abstract: Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations–generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally “looks again” the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method. Additionally, the results indicate that enhancing expert models, which are typically small and easy to iterate, enable performance improvements for VLMs.

[197] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

Ruizhuo Song, Beiming Yuan

Main category: cs.CV

TL;DR: This paper addresses the abstract reasoning bottleneck in deep learning by focusing on Raven’s Progressive Matrices (RPM) problems. It proposes a causal chain modeling approach but finds limitations in mutual information maximization, leading to three progressive improvement methods.

Details

Motivation: To enhance the abstract reasoning capabilities of machine intelligence by solving RPM problems, which serve as an authoritative benchmark for evaluating core intelligence dimensions like abstract reasoning, pattern recognition, and complex problem-solving.

Method: Adopts a causal chain modeling perspective to analyze RPM tasks, designs the DIO baseline network architecture, and progressively proposes three improvement methods to overcome limitations of mutual information maximization.

Result: Experiments reveal that the initial optimization objective (maximizing variational lower bound of mutual information) fails to enable genuine acquisition of human reasoning logic due to tightness of lower bound and inability to capture causal relationships.

Conclusion: The paper identifies fundamental limitations in current mutual information approaches for abstract reasoning and proposes progressive improvements to better capture causal relationships and human reasoning logic in RPM problem-solving.

Abstract: Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven’s Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling’’ perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:

[198] HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation

Xiping Wang, Yuxi Wang, Mengqi Zhou, Junsong Fan, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: HLG is a hierarchical method for fine-grained 3D indoor scene generation that refines layouts from furniture placement to detailed object arrangements using layout alignment and optimization networks.

Details

Motivation: Existing methods struggle with fine-grained object placements, limiting realism and utility for VR, interior design, and embodied AI applications that require detailed scene comprehension.

Method: Coarse-to-fine hierarchical approach with fine-grained layout alignment module (vertical/horizontal decoupling) and trainable layout optimization network to fix positioning, orientation, and intersection issues.

Result: Superior performance in generating realistic indoor scenes compared to existing methods, demonstrated through extensive experiments.

Conclusion: Advances scene generation field and enables detailed 3D environments for various applications; code will be released to encourage future research.

Abstract: Realistic 3D indoor scene generation is crucial for virtual reality, interior design, embodied intelligence, and scene understanding. While existing methods have made progress in coarse-scale furniture arrangement, they struggle to capture fine-grained object placements, limiting the realism and utility of generated environments. This gap hinders immersive virtual experiences and detailed scene comprehension for embodied AI applications. To address these issues, we propose Hierarchical Layout Generation (HLG), a novel method for fine-grained 3D scene generation. HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements. Specifically, our fine-grained layout alignment module constructs a hierarchical layout through vertical and horizontal decoupling, effectively decomposing complex 3D indoor scenes into multiple levels of granularity. Additionally, our trainable layout optimization network addresses placement issues, such as incorrect positioning, orientation errors, and object intersections, ensuring structurally coherent and physically plausible scene generation. We demonstrate the effectiveness of our approach through extensive experiments, showing superior performance in generating realistic indoor scenes compared to existing methods. This work advances the field of scene generation and opens new possibilities for applications requiring detailed 3D environments. We will release our code upon publication to encourage future research.

[199] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu

Main category: cs.CV

TL;DR: Drawing2CAD is a framework that converts 2D engineering drawings to parametric CAD models using sequence-to-sequence learning with specialized vector representations and dual-decoder transformers.

Details

Motivation: Traditional CAD generative methods don't align with industrial workflows that start from 2D engineering drawings, creating a gap in automatic parametric CAD generation from vector drawings.

Method: Uses network-friendly vector primitive representation, dual-decoder transformer architecture for command type and parameter generation, and soft target distribution loss function.

Result: Developed CAD-VGDrawing dataset and demonstrated effective conversion of engineering drawings to parametric CAD models with preserved geometric precision.

Conclusion: The framework successfully bridges the gap between 2D engineering drawings and parametric CAD generation, maintaining design intent and geometric accuracy throughout the transformation process.

Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

[200] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions

Ahmed Emam, Mohamed Elbassiouny, Julius Miller, Patrick Donworth, Sabine Seidel, Ribana Roscher

Main category: cs.CV

TL;DR: BuzzSet v1.0 is a large-scale dataset of 7,856 high-resolution pollinator images with 8,000+ annotated instances for automated insect detection in agricultural environments, achieving strong classification performance with transformer-based object detection.

Details

Motivation: Pollinator populations are declining due to environmental stressors, but scalable automated monitoring remains challenging due to difficulties detecting small, fast-moving, and camouflaged insects in field conditions.

Method: Created BuzzSet dataset with manually verified images using YOLOv12 for initial annotations and human refinement. Images preprocessed into 256x256 tiles. Used RF-DETR transformer-based object detector for baseline evaluation.

Result: Achieved F1 scores of 0.94 for honeybees and 0.92 for bumblebees with minimal confusion between categories. Overall mAP at 0.50 of 0.559, showing strong classification but highlighting detection challenges for small camouflaged insects.

Conclusion: BuzzSet establishes a benchmark for ecological computer vision, demonstrating the challenge of detecting camouflaged insects in natural vegetation and providing a foundation for future research in small object detection under realistic conditions.

Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to anthropogenic and environmental stressors. Scalable, automated monitoring in agricultural environments remains an open challenge due to the difficulty of detecting small, fast-moving, and often camouflaged insects. To address this, we present BuzzSet v1.0, a large-scale dataset of high-resolution pollinator images collected under real field conditions. BuzzSet contains 7,856 manually verified images with more than 8,000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were produced using a YOLOv12 model trained on external data and refined through human verification with open-source tools. All images were preprocessed into 256 x 256 tiles to improve the detection of small insects. We provide baselines using the RF-DETR transformer-based object detector. The model achieves strong classification accuracy with F1 scores of 0.94 and 0.92 for honeybees and bumblebees, with minimal confusion between these categories. The unidentified class remains more difficult due to label ambiguity and fewer samples, yet still contributes insights for robustness evaluation. Overall detection performance (mAP at 0.50 of 0.559) illustrates the challenging nature of the dataset and its potential to drive advances in small object detection under realistic ecological conditions. Future work focuses on expanding the dataset to version 2.0 with additional annotations and evaluating further detection strategies. BuzzSet establishes a benchmark for ecological computer vision, with the primary challenge being reliable detection of insects frequently camouflaged within natural vegetation, highlighting an open problem for future research.

[201] Encoder-Only Image Registration

Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yue, Min Liu, Yaonan Wang, Hang Zhang

Main category: cs.CV

TL;DR: Proposes EOIR framework using 3-layer ConvNet for feature extraction and flow estimation to achieve better accuracy-efficiency trade-off in deformable image registration.

Details

Motivation: Address challenges in learning-based deformable image registration including computational complexity and handling large deformations by analyzing ConvNets' roles in registration.

Method: Encoder-Only Image Registration (EOIR) framework separates feature learning from flow estimation, uses 3-layer ConvNet for feature extraction and 3-layer flow estimators to build Laplacian feature pyramid for progressive diffeomorphic deformations.

Result: Superior accuracy-efficiency and accuracy-smoothness trade-offs across five datasets of different modalities and anatomical regions, with comparable accuracy but better efficiency/smoothness.

Conclusion: EOIR effectively addresses registration challenges by leveraging ConvNets’ dual roles in linearizing intensities and harmonizing contrast variations, providing a practical solution with publicly available code.

Abstract: Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.

[202] Kwai Keye-VL 1.5 Technical Report

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang

Main category: cs.CV

TL;DR: Keye-VL-1.5 introduces a Slow-Fast video encoding strategy, progressive context length extension, and comprehensive post-training to significantly improve video understanding in MLLMs.

Details

Motivation: Video understanding remains challenging for MLLMs due to the dynamic nature of videos and the trade-off between spatial resolution and temporal coverage in existing models.

Method: Three key innovations: 1) Slow-Fast video encoding that dynamically allocates resources based on inter-frame similarity, 2) Progressive 4-stage pre-training extending context from 8K to 128K tokens, 3) Comprehensive post-training with chain-of-thought data, GSPO-based RL, and alignment training.

Result: Significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

Conclusion: Keye-VL-1.5 successfully addresses fundamental challenges in video comprehension through its innovative encoding strategy, progressive training methodology, and comprehensive post-training pipeline.

Abstract: In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

cs.AI

[203] Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning

Wei Yang, Jesse Thomason

Main category: cs.AI

TL;DR: MPDF framework enables LLM agents to learn adaptive meta-cognitive policies (Persist/Refine/Concede) using novel SoftRankPO algorithm, achieving 4-5% accuracy gains over SOTA methods.

Details

Motivation: Current multi-agent LLM systems use fixed collaboration protocols that overlook agents' internal deliberative capabilities and meta-cognitive states like uncertainty, treating agents as passive executors.

Method: Introduced Meta-Policy Deliberation Framework (MPDF) with decentralized policy learning over meta-cognitive actions. Developed SoftRankPO algorithm using rank-based reward shaping through smooth normal quantiles to stabilize training against reward variance.

Result: Achieved 4-5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms.

Conclusion: Presents a paradigm shift from designing fixed protocols to learning dynamic, deliberative strategies for adaptive meta-cognitive policies in multi-agent LLM systems.

Abstract: Multi-agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro-level orchestration while overlooking agents’ internal deliberative capabilities. This critical meta-cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta-Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high-level meta-cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a a 4-5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta-cognitive policies for multi-agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.

[204] PG-Agent: An Agent Powered by Page Graph

Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, Wei Jiang

Main category: cs.AI

TL;DR: The paper proposes PG-Agent, a GUI agent framework that converts sequential episodes into page graphs to better capture complex page transitions, uses RAG for retrieving GUI perception guidelines, and employs multi-agent task decomposition for improved generalization to unseen scenarios.

Details

Motivation: Existing GUI agents using sequential episodes fail to capture complex transition relationships between pages, making it challenging for agents to deeply perceive the GUI environment and generalize to new scenarios.

Method: An automated pipeline transforms sequential episodes into page graphs that model graph structures of naturally connected pages. RAG technology retrieves reliable GUI perception guidelines from these graphs, and a multi-agent framework (PG-Agent) with task decomposition strategy is injected with these guidelines.

Result: Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.

Conclusion: The proposed PG-Agent framework successfully addresses the limitations of sequential episode approaches by leveraging page graphs and RAG technology, enabling better GUI environment perception and generalization to unseen scenarios.

Abstract: Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.

[205] Multilinear and Linear Programs for Partially Identifiable Queries in Quasi-Markovian Structural Causal Models

João P. Arroyo, João G. Rodrigues, Daniel Lawand, Denis D. Mauá, Junkyu Lee, Radu Marinescu, Alex Gray, Eduardo R. Laurentino, Fabio G. Cozman

Main category: cs.AI

TL;DR: Novel algorithm for computing tight probability bounds in quasi-Markovian causal models using column generation and linear integer programming, outperforming existing methods.

Details

Motivation: Address the challenge of partially identifiable queries in causal models where exogenous variables are not fully specified, making precise probability computation impossible.

Method: Uses column generation with auxiliary linear integer programs to compute probability bounds, exploiting input probabilities over endogenous variables and focusing on single-intervention scenarios.

Result: Demonstrates that polynomial cardinality representation for exogenous variables is possible, with experiments showing column generation techniques superior to existing methods.

Conclusion: Provides an efficient approach for handling partial identifiability in quasi-Markovian causal models through innovative programming techniques.

Abstract: We investigate partially identifiable queries in a class of causal models. We focus on acyclic Structural Causal Models that are quasi-Markovian (that is, each endogenous variable is connected with at most one exogenous confounder). We look into scenarios where endogenous variables are observed (and a distribution over them is known), while exogenous variables are not fully specified. This leads to a representation that is in essence a Bayesian network where the distribution of root variables is not uniquely determined. In such circumstances, it may not be possible to precisely compute a probability value of interest. We thus study the computation of tight probability bounds, a problem that has been solved by multilinear programming in general, and by linear programming when a single confounded component is intervened upon. We present a new algorithm to simplify the construction of such programs by exploiting input probabilities over endogenous variables. For scenarios with a single intervention, we apply column generation to compute a probability bound through a sequence of auxiliary linear integer programs, thus showing that a representation with polynomial cardinality for exogenous variables is possible. Experiments show column generation techniques to be superior to existing methods.

[206] Psychologically Enhanced AI Agents

Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler

Main category: cs.AI

TL;DR: MBTI-in-Thoughts framework uses MBTI personality conditioning via prompt engineering to enhance LLM agent behavior along cognitive and affective axes, showing consistent behavioral biases across tasks without fine-tuning.

Details

Motivation: To bridge psychological theory and LLM behavior design by creating psychologically grounded AI agents that exhibit consistent, interpretable personality traits for improved performance in diverse tasks.

Method: Priming LLM agents with distinct MBTI personality archetypes through prompt engineering, using the official 16Personalities test for automated verification of trait persistence, and experimenting with structured multi-agent communication protocols.

Result: Personality priming yields consistent behavioral biases: emotionally expressive agents excel in narrative generation, analytically primed agents adopt stable strategies in game theory, and self-reflection improves cooperation and reasoning quality.

Conclusion: The framework successfully establishes a foundation for psychologically enhanced AI agents without fine-tuning, generalizes to other psychological frameworks (Big Five, HEXACO, Enneagram), and enables control over LLM behavior through psychologically grounded personality conditioning.

Abstract: We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent communication protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.

[207] Diffusion-RL Based Air Traffic Conflict Detection and Resolution Method

Tonghe Li, Jixin Liu, Weili Zeng, Hao Jiang

Main category: cs.AI

TL;DR: Proposes Diffusion-AC, a novel autonomous conflict resolution framework that integrates diffusion probabilistic models to overcome unimodal bias in DRL approaches, enabling multimodal decision-making for air traffic conflict detection and resolution.

Details

Motivation: Existing Deep Reinforcement Learning approaches for Conflict Detection and Resolution suffer from unimodal bias, leading to decision deadlocks and lack of flexibility in complex air traffic scenarios with dynamic constraints.

Method: Integrates diffusion probabilistic models into CD&R, modeling policy as a reverse denoising process guided by a value function to generate multimodal action distribution. Uses Density-Progressive Safety Curriculum for stable training from sparse to high-density traffic environments.

Result: Significantly outperforms state-of-the-art DRL benchmarks. Achieves 94.1% success rate in high-density scenarios and reduces Near Mid-Air Collisions by 59% compared to best baseline, enhancing safety margin through flexible multimodal decision-making.

Conclusion: Diffusion-AC framework successfully overcomes unimodal bias limitations, providing high-quality multimodal decision-making capabilities that significantly improve safety and performance in complex air traffic conflict resolution scenarios.

Abstract: In the context of continuously rising global air traffic, efficient and safe Conflict Detection and Resolution (CD&R) is paramount for air traffic management. Although Deep Reinforcement Learning (DRL) offers a promising pathway for CD&R automation, existing approaches commonly suffer from a “unimodal bias” in their policies. This leads to a critical lack of decision-making flexibility when confronted with complex and dynamic constraints, often resulting in “decision deadlocks.” To overcome this limitation, this paper pioneers the integration of diffusion probabilistic models into the safety-critical task of CD&R, proposing a novel autonomous conflict resolution framework named Diffusion-AC. Diverging from conventional methods that converge to a single optimal solution, our framework models its policy as a reverse denoising process guided by a value function, enabling it to generate a rich, high-quality, and multimodal action distribution. This core architecture is complemented by a Density-Progressive Safety Curriculum (DPSC), a training mechanism that ensures stable and efficient learning as the agent progresses from sparse to high-density traffic environments. Extensive simulation experiments demonstrate that the proposed method significantly outperforms a suite of state-of-the-art DRL benchmarks. Most critically, in the most challenging high-density scenarios, Diffusion-AC not only maintains a high success rate of 94.1% but also reduces the incidence of Near Mid-Air Collisions (NMACs) by approximately 59% compared to the next-best-performing baseline, significantly enhancing the system’s safety margin. This performance leap stems from its unique multimodal decision-making capability, which allows the agent to flexibly switch to effective alternative maneuvers.

[208] Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

Main category: cs.AI

TL;DR: Training LLMs with dynamic planning framework that decides when to plan during execution, improving efficiency and performance on long-horizon tasks.

Details

Motivation: Existing methods like ReAct require always planning before every action, which is computationally expensive and degrades performance on long-horizon tasks, while never planning limits performance.

Method: Two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) reinforcement learning to refine this capability in long-horizon environments.

Result: Dynamic planning agents are more sample-efficient and consistently achieve more complex objectives in Crafter environment. They can also be effectively steered by human-written plans, surpassing independent capabilities.

Conclusion: First work to train LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, enabling more efficient, adaptive, and controllable agentic systems.

Abstract: Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.

[209] (Ir)rationality in AI: State of the Art, Research Challenges and Open Questions

Olivia Macmillan-Scott, Mirco Musolesi

Main category: cs.AI

TL;DR: Survey paper examining rationality concepts in AI, exploring how definitions vary across fields and analyzing scenarios where irrational behavior can be optimal for artificial agents.

Details

Motivation: The centrality of rationality in AI lacks unified definitions, and there's a need to understand how irrational behaviors can sometimes be optimal and how to handle interactions with irrational agents.

Method: Literature survey approach examining rationality concepts from economics, philosophy, and psychology, analyzing scenarios where irrational behavior proves optimal, and reviewing methods for identifying and interacting with irrational agents.

Result: Identifies that irrational behaviors can be optimal in certain scenarios, finds limited existing work on handling irrational agents, and suggests adapting adversarial methods for artificial agent interactions.

Conclusion: Many open questions remain regarding rationality definitions, optimal irrational behaviors, and human-AI interactions, with potential for cross-disciplinary approaches from adversarial scenarios.

Abstract: The concept of rationality is central to the field of artificial intelligence (AI). Whether we are seeking to simulate human reasoning, or trying to achieve bounded optimality, our goal is generally to make artificial agents as rational as possible. Despite the centrality of the concept within AI, there is no unified definition of what constitutes a rational agent. This article provides a survey of rationality and irrationality in AI, and sets out the open questions in this area. We consider how the understanding of rationality in other fields has influenced its conception within AI, in particular work in economics, philosophy and psychology. Focusing on the behaviour of artificial agents, we examine irrational behaviours that can prove to be optimal in certain scenarios. Some methods have been developed to deal with irrational agents, both in terms of identification and interaction, however work in this area remains limited. Methods that have up to now been developed for other purposes, namely adversarial scenarios, may be adapted to suit interactions with artificial agents. We further discuss the interplay between human and artificial agents, and the role that rationality plays within this interaction; many questions remain in this area, relating to potentially irrational behaviour of both humans and artificial agents.

[210] Explainable Knowledge Graph Retrieval-Augmented Generation (KG-RAG) with KG-SMILE

Zahra Zehtabi Sabeti Moghaddam, Zeinab Dehghani, Maneeha Rani, Koorosh Aslansefat, Bhupesh Kumar Mishra, Rameez Raja Kureshi, Dhavalkumar Thakker

Main category: cs.AI

TL;DR: KG-SMILE is a perturbation-based framework that provides token and component-level interpretability for Graph RAG systems, making them more transparent by identifying influential graph entities and relations.

Details

Motivation: Generative AI often produces hallucinations and unverifiable claims, limiting reliability in sensitive domains like healthcare. While RAG improves accuracy by grounding outputs in external knowledge, it remains opaque and heavily dependent on data quality.

Method: Developed a method-agnostic, perturbation-based framework that applies controlled perturbations, computes similarities, and trains weighted linear surrogates to identify influential graph entities and relations in Graph RAG systems.

Result: KG-SMILE produces stable, human-aligned explanations and demonstrates strong performance across comprehensive attribution metrics including fidelity, faithfulness, consistency, stability, and accuracy.

Conclusion: KG-SMILE effectively balances model effectiveness with interpretability, fostering greater transparency and trust in machine learning technologies, particularly for Graph RAG systems in sensitive domains.

Abstract: Generative AI, such as Large Language Models (LLMs), has achieved impressive progress but still produces hallucinations and unverifiable claims, limiting reliability in sensitive domains. Retrieval-Augmented Generation (RAG) improves accuracy by grounding outputs in external knowledge, especially in domains like healthcare, where precision is vital. However, RAG remains opaque and essentially a black box, heavily dependent on data quality. We developed a method-agnostic, perturbation-based framework that provides token and component-level interoperability for Graph RAG using SMILE and named it as Knowledge-Graph (KG)-SMILE. By applying controlled perturbations, computing similarities, and training weighted linear surrogates, KG-SMILE identifies the graph entities and relations most influential to generated outputs, thereby making RAG more transparent. We evaluate KG-SMILE using comprehensive attribution metrics, including fidelity, faithfulness, consistency, stability, and accuracy. Our findings show that KG-SMILE produces stable, human-aligned explanations, demonstrating its capacity to balance model effectiveness with interpretability and thereby fostering greater transparency and trust in machine learning technologies.

[211] Theory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation

Riddhi J. Pitliya, Ozan Çatal, Toon Van de Maele, Corrado Pezzato, Tim Verbelen

Main category: cs.AI

TL;DR: A novel multi-agent cooperation approach using Theory of Mind within active inference that enables agents to infer others’ beliefs from observable behavior without explicit communication or shared models.

Details

Motivation: To enable more effective multi-agent cooperation by implementing Theory of Mind capabilities, allowing agents to understand and reason about others' differing knowledge and goals during planning.

Method: ToM-equipped agents maintain distinct representations of their own and others’ beliefs/goals, using an extended inference tree-based planning algorithm for recursive reasoning in joint policy spaces.

Result: ToM agents outperformed non-ToM counterparts in collision avoidance and foraging simulations by better avoiding collisions and reducing redundant efforts through belief inference.

Conclusion: The approach demonstrates potential for generalizable and scalable multi-agent systems while providing computational insights into Theory of Mind mechanisms.

Abstract: Theory of Mind (ToM) – the ability to understand that others can have differing knowledge and goals – enables agents to reason about others’ beliefs while planning their own actions. We present a novel approach to multi-agent cooperation by implementing ToM within active inference. Unlike previous active inference approaches to multi-agent cooperation, our method neither relies on task-specific shared generative models nor requires explicit communication. In our framework, ToM-equipped agents maintain distinct representations of their own and others’ beliefs and goals. ToM agents then use an extended and adapted version of the sophisticated inference tree-based planning algorithm to systematically explore joint policy spaces through recursive reasoning. We evaluate our approach through collision avoidance and foraging simulations. Results suggest that ToM agents cooperate better compared to non-ToM counterparts by being able to avoid collisions and reduce redundant efforts. Crucially, ToM agents accomplish this by inferring others’ beliefs solely from observable behaviour and considering them when planning their own actions. Our approach shows potential for generalisable and scalable multi-agent systems while providing computational insights into ToM mechanisms.

[212] CausalARC: Abstract Reasoning with Causal World Models

Jacqueline Maasch, John Kalantari, Kia Khezeli

Main category: cs.AI

TL;DR: CausalARC is a testbed for evaluating AI reasoning in low-data and OOD settings using causal world models with observational, interventional, and counterfactual feedback.

Details

Motivation: Reasoning requires adaptation to novel problems with limited data and distribution shift, but current benchmarks lack principled causal modeling.

Method: Create reasoning tasks from structural causal models, provide few-shot demonstrations with three types of causal feedback (observational, interventional, counterfactual).

Result: Framework enables evaluation across four settings: abstract reasoning with test-time training, counterfactual reasoning, program synthesis, and causal discovery.

Conclusion: CausalARC provides a principled testbed for evaluating AI reasoning capabilities in causal, low-data, and out-of-distribution scenarios.

Abstract: Reasoning requires adaptation to novel problem settings under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning.

[213] Towards a Neurosymbolic Reasoning System Grounded in Schematic Representations

François Olivier, Zied Bouraoui

Main category: cs.AI

TL;DR: A neurosymbolic system called Embodied-LM that grounds LLM reasoning in embodied cognitive structures using spatial schemas and Answer Set Programming, improving logical reasoning interpretability.

Details

Motivation: LLMs remain error-prone in logical reasoning and lack robust mental representations that enable human-like comprehension, needing grounding in embodied cognitive structures.

Method: Introduces Embodied-LM system that uses image schemas (sensorimotor patterns) and formalizes them with declarative spatial reasoning in Answer Set Programming to guide LLMs.

Result: Demonstrates that LLMs can interpret scenarios through embodied cognitive structures, which can be formalized as executable programs supporting effective logical reasoning with enhanced interpretability.

Conclusion: The system establishes computational foundation for incorporating more complex dynamic representations, though current implementation focuses on spatial primitives.

Abstract: Despite significant progress in natural language understanding, Large Language Models (LLMs) remain error-prone when performing logical reasoning, often lacking the robust mental representations that enable human-like comprehension. We introduce a prototype neurosymbolic system, Embodied-LM, that grounds understanding and logical reasoning in schematic representations based on image schemas-recurring patterns derived from sensorimotor experience that structure human cognition. Our system operationalizes the spatial foundations of these cognitive structures using declarative spatial reasoning within Answer Set Programming. Through evaluation on logical deduction problems, we demonstrate that LLMs can be guided to interpret scenarios through embodied cognitive structures, that these structures can be formalized as executable programs, and that the resulting representations support effective logical reasoning with enhanced interpretability. While our current implementation focuses on spatial primitives, it establishes the computational foundation for incorporating more complex and dynamic representations.

[214] Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen

Main category: cs.AI

TL;DR: RL enhances LLM reasoning through emergent hierarchical planning, but current methods inefficiently optimize all tokens. HICRA focuses optimization on high-impact planning tokens, outperforming baselines.

Details

Motivation: Understanding why RL improves LLM reasoning abilities and addressing inefficiencies in current RL algorithms that apply optimization pressure indiscriminately across all tokens.

Method: Proposed HIerarchy-Aware Credit Assignment (HICRA) algorithm that concentrates optimization efforts specifically on high-impact planning tokens rather than applying uniform pressure across all tokens.

Result: HICRA significantly outperforms strong baselines like GRPO, demonstrating that focusing on strategic planning bottlenecks is key to unlocking advanced reasoning capabilities in LLMs.

Conclusion: Semantic entropy serves as a superior metric for measuring strategic exploration, and hierarchical-aware optimization focusing on planning tokens is crucial for efficient RL-based reasoning enhancement in LLMs.

Abstract: Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like aha moments", length-scaling’’ and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.

[215] PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Liu, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen

Main category: cs.AI

TL;DR: PIN introduces a novel multimodal data format combining structured Markdown with holistic document images, and releases two large-scale datasets (PIN-200M and PIN-14M) to improve knowledge integration in large multimodal models.

Details

Motivation: Address persistent perceptual and reasoning errors in large multimodal models (LMMs) when interpreting complex visual data and deducing multimodal relationships.

Method: Developed PIN format that pairs semantically rich Markdown files (preserving fine-grained textual structures) with holistic overall images capturing complete document layouts. Constructed two datasets from diverse web and scientific sources in English and Chinese.

Result: Created and released PIN-200M (~200M documents) and PIN-14M (~14M documents) datasets with detailed statistical analyses and quality signals for easy filtering and selection.

Conclusion: Provides versatile data format and substantial resources to enable new research in pre-training strategies and development of more powerful knowledge-intensive LMMs.

Abstract: Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.

[216] An Empirical Evaluation of Factors Affecting SHAP Explanation of Time Series Classification

Davide Italo Serramazza, Nikos Papadeas, Zahraa Abdallah, Georgiana Ifrim

Main category: cs.AI

TL;DR: Equal-length segmentation outperforms custom time series segmentation algorithms for SHAP-based time series explanations, with segment count being more important than segmentation method. A novel length-weighted normalization technique improves attribution quality.

Details

Motivation: SHAP's computational complexity limits its practicality for long time series. While feature aggregation via segmentation reduces computation time, the optimal segmentation strategy remains unclear.

Method: Investigated 8 different time series segmentation algorithms and evaluated them using InterpretTime and AUC Difference methodologies on both multivariate and univariate time series.

Result: Number of segments has greater impact on explanation quality than specific segmentation method. Equal-length segmentation consistently outperforms custom algorithms. Novel length-weighted normalization improves attribution quality.

Conclusion: Simple equal-length segmentation is more effective than complex custom segmentation methods for SHAP-based time series explanations, and length-weighted normalization enhances attribution results.

Abstract: Explainable AI (XAI) has become an increasingly important topic for understanding and attributing the predictions made by complex Time Series Classification (TSC) models. Among attribution methods, SHapley Additive exPlanations (SHAP) is widely regarded as an excellent attribution method; but its computational complexity, which scales exponentially with the number of features, limits its practicality for long time series. To address this, recent studies have shown that aggregating features via segmentation, to compute a single attribution value for a group of consecutive time points, drastically reduces SHAP running time. However, the choice of the optimal segmentation strategy remains an open question. In this work, we investigated eight different Time Series Segmentation algorithms to understand how segment compositions affect the explanation quality. We evaluate these approaches using two established XAI evaluation methodologies: InterpretTime and AUC Difference. Through experiments on both Multivariate (MTS) and Univariate Time Series (UTS), we find that the number of segments has a greater impact on explanation quality than the specific segmentation method. Notably, equal-length segmentation consistently outperforms most of the custom time series segmentation algorithms. Furthermore, we introduce a novel attribution normalisation technique that weights segments by their length and we show that it consistently improves attribution quality.

[217] PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon A Gatys

Main category: cs.AI

TL;DR: PersonaTeaming introduces personas into automated red-teaming to improve risk discovery by incorporating diverse identities and backgrounds in prompt generation, achieving up to 144.1% higher attack success rates.

Details

Motivation: Current automated red-teaming methods don't consider how human identities and backgrounds shape red-teaming strategies and risk discovery, creating a gap in AI safety testing.

Method: Developed PersonaTeaming with two approaches: using predefined ‘red-teaming expert’ and ‘regular AI user’ personas, and a dynamic algorithm that generates adaptive personas for different prompts. Also created new metrics to measure mutation distance.

Result: Experiments showed up to 144.1% improvement in attack success rates compared to state-of-the-art RainbowPlus method, while maintaining prompt diversity.

Conclusion: Persona-based mutation effectively enhances automated red-teaming, revealing opportunities for complementarity between automated and human approaches in AI safety testing.

Abstract: Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people’s background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either “red-teaming expert” personas or “regular AI user” personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the “mutation distance” to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

[218] The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs

Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

Main category: cs.AI

TL;DR: LLMs develop personality-like traits through training, but self-reported traits don’t reliably predict actual behavior, and persona interventions affect self-reports more than behavior.

Details

Motivation: To systematically characterize LLM personality across training stages, validate self-reported traits against behavior, and test intervention effects, addressing gaps in prior simplified approaches.

Method: Analyzed LLM personality across three dimensions: trait evolution during training, predictive validity of self-reports in behavioral tasks, and impact of persona injection interventions on both self-reports and behavior.

Result: Instructional alignment stabilizes trait expression and strengthens correlations resembling human patterns, but self-reported traits don’t reliably predict behavior, and persona injection affects self-reports more than actual behavior.

Conclusion: LLM surface-level trait expression differs from behavioral consistency, challenging assumptions about LLM personality and highlighting need for deeper evaluation in alignment and interpretability.

Abstract: Personality traits have long been studied as predictors of human behavior.Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, Dongyeop Kang

Main category: cs.AI

TL;DR: LLM agents show significant internal inconsistencies across different experimental settings, failing to maintain behavioral consistency despite generating human-like survey responses, which limits their ability to substitute real human participants in research.

Details

Motivation: To evaluate whether LLM-based synthetic agents can truly substitute real human participants in research by examining their internal consistency across different experimental settings, rather than just comparing their survey responses to human data.

Method: Developed a study to reveal agents’ internal states and examine their behavior in basic dialogue settings, testing behavioral hypotheses to assess consistency between conversation behavior and revealed internal states across different model families and sizes.

Result: Found significant internal inconsistencies in LLMs across all tested model families and sizes. While agents can generate responses matching human counterparts, they fail to maintain internal consistency in different experimental contexts.

Conclusion: LLM agents lack the internal consistency needed to accurately substitute for real human participants in research, representing a critical capability gap despite their ability to produce human-like survey responses.

Abstract: The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent’s internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent’s conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.

[220] RAGuard: A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs

Connor Walker, Koorosh Aslansefat, Mohammad Naveed Akram, Yiannis Papadopoulos

Main category: cs.AI

TL;DR: RAGuard is an enhanced RAG framework that integrates safety-critical documents with technical manuals using parallel queries and separate retrieval budgets to ensure both technical accuracy and safety coverage in offshore wind maintenance.

Details

Motivation: Conventional LLMs often fail in highly specialized or unexpected scenarios in offshore wind maintenance, where accuracy and safety are critical but current systems lack proper safety integration.

Method: RAGuard uses parallel queries to two indices (knowledge and safety) with separate retrieval budgets, plus SafetyClamp extension that fetches larger candidate pool with hard-clamping for exact safety slot guarantees.

Result: Both RAGuard extensions increased Safety Recall@K from almost 0% in standard RAG to over 50%, while maintaining Technical Recall above 60% across sparse, dense, and hybrid retrieval paradigms.

Conclusion: RAGuard and SafetyClamp establish a new standard for integrating safety assurance into LLM-powered decision support for critical maintenance contexts.

Abstract: Accuracy and safety are paramount in Offshore Wind (OSW) maintenance, yet conventional Large Language Models (LLMs) often fail when confronted with highly specialised or unexpected scenarios. We introduce RAGuard, an enhanced Retrieval-Augmented Generation (RAG) framework that explicitly integrates safety-critical documents alongside technical manuals.By issuing parallel queries to two indices and allocating separate retrieval budgets for knowledge and safety, RAGuard guarantees both technical depth and safety coverage. We further develop a SafetyClamp extension that fetches a larger candidate pool, “hard-clamping” exact slot guarantees to safety. We evaluate across sparse (BM25), dense (Dense Passage Retrieval) and hybrid retrieval paradigms, measuring Technical Recall@K and Safety Recall@K. Both proposed extensions of RAG show an increase in Safety Recall@K from almost 0% in RAG to more than 50% in RAGuard, while maintaining Technical Recall above 60%. These results demonstrate that RAGuard and SafetyClamp have the potential to establish a new standard for integrating safety assurance into LLM-powered decision support in critical maintenance contexts.

[221] Leveraging LLM-Based Agents for Intelligent Supply Chain Planning

Yongzhi Qi, Jiaheng Yin, Jianshen Zhang, Dongyang Geng, Zhengyu Chen, Hao Hu, Wei Qi, Zuo-Jun Max Shen

Main category: cs.AI

TL;DR: LLM-based Supply Chain Planning Agent framework for e-commerce that understands domain knowledge, decomposes tasks, and generates evidence-based planning reports, deployed at JD.com with improved efficiency and accuracy.

Details

Motivation: Address the complex challenge of supply chain planning involving multiple entities and dynamic adjustments while ensuring interpretability, efficiency, and reliability in e-commerce operations.

Method: Constructed a Supply Chain Planning Agent (SCPA) framework that leverages large language models to understand domain knowledge, comprehend operator needs, decompose tasks, and create/utilize tools for evidence-based planning.

Result: Successfully deployed in JD.com’s real-world scenario, effectively reducing labor costs while improving accuracy, stock availability, and other key supply chain metrics.

Conclusion: Demonstrates the feasibility and practical value of LLM-agent applications in supply chain management, providing a scalable solution for complex planning problems in e-commerce platforms.

Abstract: In supply chain management, planning is a critical concept. The movement of physical products across different categories, from suppliers to warehouse management, to sales, and logistics transporting them to customers, entails the involvement of many entities. It covers various aspects such as demand forecasting, inventory management, sales operations, and replenishment. How to collect relevant data from an e-commerce platform’s perspective, formulate long-term plans, and dynamically adjust them based on environmental changes, while ensuring interpretability, efficiency, and reliability, is a practical and challenging problem. In recent years, the development of AI technologies, especially the rapid progress of large language models, has provided new tools to address real-world issues. In this work, we construct a Supply Chain Planning Agent (SCPA) framework that can understand domain knowledge, comprehend the operator’s needs, decompose tasks, leverage or create new tools, and return evidence-based planning reports. We deploy this framework in JD.com’s real-world scenario, demonstrating the feasibility of LLM-agent applications in the supply chain. It effectively reduced labor and improved accuracy, stock availability, and other key metrics.

[222] What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models

Pierre Le Coz, Jia An Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich

Main category: cs.AI

TL;DR: LLMs show promise for social policymaking in homelessness alleviation when used with expert collaboration and contextual calibration.

Details

Motivation: Evaluate whether LLMs align with domain experts to inform social policymaking for homelessness, affecting 150M+ people worldwide.

Method: Developed benchmark with decision scenarios across 4 geographies using Capability Approach framework. Created automated pipeline connecting policies to agent-based model for social impact simulation.

Result: Reveals promising potential for LLMs in social policymaking when used with responsible guardrails and expert collaboration.

Conclusion: LLMs can provide valuable alternative policies at scale for homelessness alleviation when properly calibrated with local domain expertise.

Abstract: Large language models (LLMs) are increasingly being adopted in high-stakes domains. Their capacity to process vast amounts of unstructured data, explore flexible scenarios, and handle a diversity of contextual factors can make them uniquely suited to provide new insights for the complexity of social policymaking. This article evaluates whether LLMs’ are aligned with domain experts (and among themselves) to inform social policymaking on the subject of homelessness alleviation - a challenge affecting over 150 million people worldwide. We develop a novel benchmark comprised of decision scenarios with policy choices across four geographies (South Bend, USA; Barcelona, Spain; Johannesburg, South Africa; Macau SAR, China). The policies in scope are grounded in the conceptual framework of the Capability Approach for human development. We also present an automated pipeline that connects the benchmarked policies to an agent-based model, and we explore the social impact of the recommended policies through simulated social scenarios. The paper results reveal promising potential to leverage LLMs for social policy making. If responsible guardrails and contextual calibrations are introduced in collaboration with local domain experts, LLMs can provide humans with valuable insights, in the form of alternative policies at scale.

[223] An Agentic Model Context Protocol Framework for Medical Concept Standardization

Jaerong Ahn, Andrew Wen, Nan Wang, Heling Jia, Zhiyi Yue, Sunyang Fu, Hongfang Liu

Main category: cs.AI

TL;DR: A zero-training system using Model Context Protocol (MCP) to prevent LLM hallucinations and enable accurate medical term mapping to OMOP CDM standards without requiring training or expert validation.

Details

Motivation: Mapping source medical terms to OMOP standard concepts is resource-intensive and error-prone, and while LLMs could help, their tendency to hallucinate makes them unsuitable for clinical use without extensive training and validation.

Method: Developed a system based on Model Context Protocol (MCP) framework that allows LLMs to interact with external resources and tools for vocabulary lookups and structured reasoning, preventing hallucinations without requiring training.

Result: The system enables explainable mapping, significantly improves efficiency and accuracy with minimal effort, and provides real-time vocabulary lookups suitable for both exploratory and production environments.

Conclusion: The MCP-based system provides a practical solution for medical term mapping that prevents LLM hallucinations while maintaining efficiency and accuracy, making it suitable for immediate clinical deployment.

Abstract: The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) provides a standardized representation of heterogeneous health data to support large-scale, multi-institutional research. One critical step in data standardization using OMOP CDM is the mapping of source medical terms to OMOP standard concepts, a procedure that is resource-intensive and error-prone. While large language models (LLMs) have the potential to facilitate this process, their tendency toward hallucination makes them unsuitable for clinical deployment without training and expert validation. Here, we developed a zero-training, hallucination-preventive mapping system based on the Model Context Protocol (MCP), a standardized and secure framework allowing LLMs to interact with external resources and tools. The system enables explainable mapping and significantly improves efficiency and accuracy with minimal effort. It provides real-time vocabulary lookups and structured reasoning outputs suitable for immediate use in both exploratory and production environments.

[224] A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters: A Case Study in Shanghai

Kaizhen Tan, Yufan Wu, Yuxuan Liu, Haoran Zeng

Main category: cs.AI

TL;DR: AI-powered framework analyzes tourist perception in historic urban quarters using social media data, combining visual analysis of photos and sentiment analysis of reviews to understand aesthetic preferences and satisfaction across multiple dimensions.

Details

Motivation: Understanding tourist perception is crucial for sustainable urban planning in historic quarters, as these areas preserve cultural heritage while serving tourism and daily life. Current approaches need integrated methods to decode how tourists perceive these environments.

Method: Multimodal AI framework using social media data: 1) Semantic segmentation for visual focus areas in photos, 2) Color clustering for aesthetic preferences analysis, 3) Hybrid sentiment analysis (rule-based + multi-task BERT) for reviews across four dimensions (activities, built environment, services, business formats). Applied to 12 historic quarters in Shanghai.

Result: Revealed spatial variations in aesthetic appeal and emotional response. Found divergence between social media photo colors and real street views, indicating gaps between visual expectations and actual built environment. Identified tourist preferences and perceptual biases.

Conclusion: The framework provides an integrated, data-driven approach for understanding tourist perception, contributing to informed decision-making in tourism, heritage conservation, and public space design, rather than focusing on single technical innovations.

Abstract: Historic urban quarters play a vital role in preserving cultural heritage while serving as vibrant spaces for tourism and everyday life. Understanding how tourists perceive these environments is essential for sustainable, human-centered urban planning. This study proposes a multidimensional AI-powered framework for analyzing tourist perception in historic urban quarters using multimodal data from social media. Applied to twelve historic quarters in central Shanghai, the framework integrates focal point extraction, color theme analysis, and sentiment mining. Visual focus areas are identified from tourist-shared photos using a fine-tuned semantic segmentation model. To assess aesthetic preferences, dominant colors are extracted using a clustering method, and their spatial distribution across quarters is analyzed. Color themes are further compared between social media photos and real-world street views, revealing notable shifts. This divergence highlights potential gaps between visual expectations and the built environment, reflecting both stylistic preferences and perceptual bias. Tourist reviews are evaluated through a hybrid sentiment analysis approach combining a rule-based method and a multi-task BERT model. Satisfaction is assessed across four dimensions: tourist activities, built environment, service facilities, and business formats. The results reveal spatial variations in aesthetic appeal and emotional response. Rather than focusing on a single technical innovation, this framework offers an integrated, data-driven approach to decoding tourist perception and contributes to informed decision-making in tourism, heritage conservation, and the design of aesthetically engaging public spaces.

[225] Continuous Monitoring of Large-Scale Generative AI via Deterministic Knowledge Graph Structures

Kishor Datta Gupta, Mohd Ariful Haque, Hasmot Ali, Marufa Kamal, Syed Bahauddin Alam, Mohammad Ashiqur Rahman

Main category: cs.AI

TL;DR: Proposes a systematic methodology using deterministic and LLM-generated Knowledge Graphs to continuously monitor and evaluate Generative AI reliability through real-time structural and semantic deviation analysis.

Details

Motivation: Address reliability concerns in Generative AI models (hallucinations, semantic drift, biases) and overcome limitations of subjective human evaluation methods that lack scalability and transparency.

Method: Constructs two parallel Knowledge Graphs: deterministic KG using rule-based methods and predefined ontologies, and LLM-generated KG from real-time textual data streams. Uses KG metrics (ICR, IPR, CI) to quantify deviations and establishes dynamic anomaly thresholds for automated real-time monitoring.

Result: Develops an automated framework that continuously computes deviations between deterministic and LLM-generated KGs, enabling proactive identification of semantic anomalies and hallucinations.

Conclusion: Provides a robust, scalable, and transparent evaluation framework for Generative AI reliability through structured, metric-driven comparison between deterministic and dynamically generated knowledge representations.

Abstract: Generative AI (GEN AI) models have revolutionized diverse application domains but present substantial challenges due to reliability concerns, including hallucinations, semantic drift, and inherent biases. These models typically operate as black-boxes, complicating transparent and objective evaluation. Current evaluation methods primarily depend on subjective human assessment, limiting scalability, transparency, and effectiveness. This research proposes a systematic methodology using deterministic and Large Language Model (LLM)-generated Knowledge Graphs (KGs) to continuously monitor and evaluate GEN AI reliability. We construct two parallel KGs: (i) a deterministic KG built using explicit rule-based methods, predefined ontologies, domain-specific dictionaries, and structured entity-relation extraction rules, and (ii) an LLM-generated KG dynamically derived from real-time textual data streams such as live news articles. Utilizing real-time news streams ensures authenticity, mitigates biases from repetitive training, and prevents adaptive LLMs from bypassing predefined benchmarks through feedback memorization. To quantify structural deviations and semantic discrepancies, we employ several established KG metrics, including Instantiated Class Ratio (ICR), Instantiated Property Ratio (IPR), and Class Instantiation (CI). An automated real-time monitoring framework continuously computes deviations between deterministic and LLM-generated KGs. By establishing dynamic anomaly thresholds based on historical structural metric distributions, our method proactively identifies and flags significant deviations, thus promptly detecting semantic anomalies or hallucinations. This structured, metric-driven comparison between deterministic and dynamically generated KGs delivers a robust and scalable evaluation framework.

[226] Expedition & Expansion: Leveraging Semantic Representations for Goal-Directed Exploration in Continuous Cellular Automata

Sina Khajehabdollahi, Gautier Hamon, Marko Cvjetko, Pierre-Yves Oudeyer, Clément Moulin-Frier, Cédric Colas

Main category: cs.AI

TL;DR: E&E is a hybrid exploration strategy combining local novelty search with VLM-generated linguistic goals to discover diverse patterns in continuous cellular automata, outperforming traditional methods.

Details

Motivation: Traditional novelty search methods plateau in high-dimensional behavioral spaces, failing to reach distant unexplored regions in continuous cellular automata.

Method: Alternates between local novelty-driven expansions and goal-directed expeditions using Vision-Language Models to generate linguistic descriptions of hypothetical patterns.

Result: E&E consistently uncovers more diverse solutions than existing methods in Flow Lenia CA, with expedition-originated solutions disproportionately influencing long-term exploration.

Conclusion: E&E effectively breaks through local novelty boundaries and explores behavioral landscapes in human-aligned, interpretable ways, offering promise for open-ended exploration.

Abstract: Discovering diverse visual patterns in continuous cellular automata (CA) is challenging due to the vastness and redundancy of high-dimensional behavioral spaces. Traditional exploration methods like Novelty Search (NS) expand locally by mutating known novel solutions but often plateau when local novelty is exhausted, failing to reach distant, unexplored regions. We introduce Expedition and Expansion (E&E), a hybrid strategy where exploration alternates between local novelty-driven expansions and goal-directed expeditions. During expeditions, E&E leverages a Vision-Language Model (VLM) to generate linguistic goals–descriptions of interesting but hypothetical patterns that drive exploration toward uncharted regions. By operating in semantic spaces that align with human perception, E&E both evaluates novelty and generates goals in conceptually meaningful ways, enhancing the interpretability and relevance of discovered behaviors. Tested on Flow Lenia, a continuous CA known for its rich, emergent behaviors, E&E consistently uncovers more diverse solutions than existing exploration methods. A genealogical analysis further reveals that solutions originating from expeditions disproportionately influence long-term exploration, unlocking new behavioral niches that serve as stepping stones for subsequent search. These findings highlight E&E’s capacity to break through local novelty boundaries and explore behavioral landscapes in human-aligned, interpretable ways, offering a promising template for open-ended exploration in artificial life and beyond.

[227] FaMA: LLM-Empowered Agentic Assistant for Consumer-to-Consumer Marketplace

Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

Main category: cs.AI

TL;DR: LLM-powered agentic assistant for C2C e-commerce that replaces complex GUI interactions with natural language commands, achieving 98% task success rate and 2x speedup.

Details

Motivation: Simplify complex GUI navigation on C2C platforms that makes marketplace interactions time-consuming for both buyers and sellers.

Method: Facebook Marketplace Assistant (FaMA) - an LLM-powered agent that interprets natural language commands to automate high-friction workflows like listing updates, bulk messaging, and conversational search.

Result: 98% task success rate on complex marketplace tasks and up to 2x speedup in interaction time compared to traditional interfaces.

Conclusion: Agentic conversational paradigm provides lightweight, accessible alternative to traditional app interfaces, enabling more efficient marketplace management.

Abstract: The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal-oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long-standing challenges in complex digital environments. Core tasks on Consumer-to-Consumer (C2C) e-commerce platforms often require users to navigate complex Graphical User Interfaces (GUIs), making the experience time-consuming for both buyers and sellers. This paper introduces a novel approach to simplify these interactions through an LLM-powered agentic assistant. This agent functions as a new, conversational entry point to the marketplace, shifting the primary interaction model from a complex GUI to an intuitive AI agent. By interpreting natural language commands, the agent automates key high-friction workflows. For sellers, this includes simplified updating and renewal of listings, and the ability to send bulk messages. For buyers, the agent facilitates a more efficient product discovery process through conversational search. We present the architecture for Facebook Marketplace Assistant (FaMA), arguing that this agentic, conversational paradigm provides a lightweight and more accessible alternative to traditional app interfaces, allowing users to manage their marketplace activities with greater efficiency. Experiments show FaMA achieves a 98% task success rate on solving complex tasks on the marketplace and enables up to a 2x speedup on interaction time.

[228] A Foundation Model for Chest X-ray Interpretation with Grounded Reasoning via Online Reinforcement Learning

Qika Lin, Yifan Zhu, Bin Pu, Ling Huang, Haoran Luo, Jingying Ma, Zhen Peng, Tianzhe Zhao, Fangzhi Xu, Jian Zhang, Kai He, Zhonghong Ou, Swapnil Mishra, Mengling Feng

Main category: cs.AI

TL;DR: DeepMedix-R1 is a medical foundation model for chest X-ray interpretation that produces both answers and transparent reasoning steps tied to image regions, achieving superior performance in report generation and visual question answering tasks.

Details

Motivation: Current medical foundation models generate answers in a black-box manner without transparent reasoning processes, which hinders their practical clinical deployment due to lack of interpretability.

Method: Sequential training pipeline: 1) fine-tuning on curated CXR instruction data, 2) exposure to synthetic reasoning samples for cold-start reasoning, 3) online reinforcement learning refinement to enhance reasoning quality and generation performance.

Result: Substantial improvements in report generation (14.54% over LLaVA-Rad, 31.32% over MedGemma) and visual question answering (57.75% over MedGemma, 23.06% over CheXagent). Expert review shows superior interpretability (0.7416 vs 0.2584 preference over Qwen2.5-VL-7B).

Conclusion: DeepMedix-R1 advances medical foundation models toward holistic, transparent, and clinically actionable modeling for CXR interpretation with improved performance and interpretability.

Abstract: Medical foundation models (FMs) have shown tremendous promise amid the rapid advancements in artificial intelligence (AI) technologies. However, current medical FMs typically generate answers in a black-box manner, lacking transparent reasoning processes and locally grounded interpretability, which hinders their practical clinical deployments. To this end, we introduce DeepMedix-R1, a holistic medical FM for chest X-ray (CXR) interpretation. It leverages a sequential training pipeline: initially fine-tuned on curated CXR instruction data to equip with fundamental CXR interpretation capabilities, then exposed to high-quality synthetic reasoning samples to enable cold-start reasoning, and finally refined via online reinforcement learning to enhance both grounded reasoning quality and generation performance. Thus, the model produces both an answer and reasoning steps tied to the image’s local regions for each query. Quantitative evaluation demonstrates substantial improvements in report generation (e.g., 14.54% and 31.32% over LLaVA-Rad and MedGemma) and visual question answering (e.g., 57.75% and 23.06% over MedGemma and CheXagent) tasks. To facilitate robust assessment, we propose Report Arena, a benchmarking framework using advanced language models to evaluate answer quality, further highlighting the superiority of DeepMedix-R1. Expert review of generated reasoning steps reveals greater interpretability and clinical plausibility compared to the established Qwen2.5-VL-7B model (0.7416 vs. 0.2584 overall preference). Collectively, our work advances medical FM development toward holistic, transparent, and clinically actionable modeling for CXR interpretation.

[229] Handling Infinite Domain Parameters in Planning Through Best-First Search with Delayed Partial Expansions

Ángel Aso-Mollar, Diego Aineto, Enrico Scala, Eva Onaindia

Main category: cs.AI

TL;DR: A novel best-first heuristic search algorithm that explicitly treats control parameters as decision points rather than constraints, using delayed partial expansion to efficiently handle infinite decision spaces in automated planning.

Details

Motivation: Existing approaches treat control parameters as embedded constraints rather than true decision points in the search space, limiting their effectiveness in handling continuous numeric decision variables.

Method: Developed a best-first heuristic search algorithm with delayed partial expansion, where states are incrementally expanded by generating subsets of successors to handle infinite decision spaces defined by control parameters.

Result: The algorithm proves completeness in the limit under certain conditions and demonstrates competitive performance compared to existing approaches for planning problems with control parameters.

Conclusion: Explicitly treating control parameters as decision points within a systematic search scheme with delayed partial expansion provides an efficient and competitive alternative to constraint-based approaches.

Abstract: In automated planning, control parameters extend standard action representations through the introduction of continuous numeric decision variables. Existing state-of-the-art approaches have primarily handled control parameters as embedded constraints alongside other temporal and numeric restrictions, and thus have implicitly treated them as additional constraints rather than as decision points in the search space. In this paper, we propose an efficient alternative that explicitly handles control parameters as true decision points within a systematic search scheme. We develop a best-first, heuristic search algorithm that operates over infinite decision spaces defined by control parameters and prove a notion of completeness in the limit under certain conditions. Our algorithm leverages the concept of delayed partial expansion, where a state is not fully expanded but instead incrementally expands a subset of its successors. Our results demonstrate that this novel search algorithm is a competitive alternative to existing approaches for solving planning problems involving control parameters.

[230] World Model Implanting for Test-time Adaptation of Embodied Agents

Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo

Main category: cs.AI

TL;DR: WorMI framework combines LLMs with domain-specific world models through test-time composition for robust cross-domain adaptation in embodied AI agents.

Details

Motivation: Enable embodied agents to adapt to novel domains without extensive data collection or retraining, addressing the challenge of domain adaptation in AI systems.

Method: Prototype-based world model retrieval with trajectory-based abstract representation matching, plus world-wise compound attention to integrate knowledge and align representations.

Result: Superior zero-shot and few-shot performance on VirtualHome and ALFWorld benchmarks compared to other LLM-based approaches across unseen domains.

Conclusion: WorMI demonstrates strong potential for scalable real-world deployment where adaptability and data efficiency are essential in embodied agent scenarios.

Abstract: In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain-specific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent’s policy achieves and maintains cross-domain adaptability. In the WorMI framework, we employ a prototype-based world model retrieval approach, utilizing efficient trajectory-based abstract representation matching, to incorporate relevant models into test-time composition. We also develop a world-wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model’s representation within the agent’s policy. This framework design effectively fuses domain-specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the frameworks potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.

[231] Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Chunlong Wu, Zhibo Qu

Main category: cs.AI

TL;DR: Meta-Policy Reflexion (MPR) is a hybrid framework that consolidates LLM-generated reflections into reusable meta-policy memory to improve agent performance without weight updates.

Details

Motivation: LLM agents often fail repeatedly, explore inefficiently, and lack cross-task adaptability. Existing reflective strategies produce ephemeral task-specific traces, while RL alternatives require heavy computation.

Method: MPR creates structured predicate-like Meta-Policy Memory (MPM) from reflections and applies it through soft memory-guided decoding and hard rule admissibility checks to enforce constraints and improve action selection.

Result: Empirical results show consistent gains in execution accuracy and robustness compared to Reflexion baselines, with rule admissibility further improving stability.

Conclusion: MPR successfully externalizes reusable corrective knowledge without model updates, enforces domain constraints, and retains language-based reflection adaptability, with potential for multimodal and multi-agent extensions.

Abstract: Large language model (LLM) agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided decoding and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and decoding, and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi?agent extensions.

[232] AutoPBO: LLM-powered Optimization for Local Search PBO Solvers

Jinyuan Li, Yi Chu, Yiwen Sun, Mengchuan Zou, Shaowei Cai

Main category: cs.AI

TL;DR: AutoPBO is an LLM-powered framework that automatically enhances PBO local search solvers, showing significant improvements over previous approaches while maintaining competitive performance against state-of-the-art competitors.

Details

Motivation: Local search solvers for Pseudo-Boolean Optimization require significant expert effort and manual tuning, while LLMs have shown potential in automating algorithm design but haven't been applied to PBO solver optimization.

Method: AutoPBO uses Large Language Models to automatically enhance PBO local search solvers through a novel framework that optimizes internal heuristics without manual tuning.

Result: AutoPBO demonstrates significant improvements over previous local search approaches and maintains competitive performance compared to 6 state-of-the-art competitors across 4 public benchmarks including real-world, competition, and crafted benchmarks.

Conclusion: AutoPBO offers a promising approach to automating local search solver design for Pseudo-Boolean Optimization problems, successfully leveraging LLMs to enhance solver performance without manual expert intervention.

Abstract: Pseudo-Boolean Optimization (PBO) provides a powerful framework for modeling combinatorial problems through pseudo-Boolean (PB) constraints. Local search solvers have shown excellent performance in PBO solving, and their efficiency is highly dependent on their internal heuristics to guide the search. Still, their design often requires significant expert effort and manual tuning in practice. While Large Language Models (LLMs) have demonstrated potential in automating algorithm design, their application to optimizing PBO solvers remains unexplored. In this work, we introduce AutoPBO, a novel LLM-powered framework to automatically enhance PBO local search solvers. We conduct experiments on a broad range of four public benchmarks, including one real-world benchmark, a benchmark from PB competition, an integer linear programming optimization benchmark, and a crafted combinatorial benchmark, to evaluate the performance improvement achieved by AutoPBO and compare it with six state-of-the-art competitors, including two local search PBO solvers NuPBO and OraSLS, two complete PB solvers PBO-IHS and RoundingSat, and two mixed integer programming (MIP) solvers Gurobi and SCIP. AutoPBO demonstrates significant improvements over previous local search approaches, while maintaining competitive performance compared to state-of-the-art competitors. The results suggest that AutoPBO offers a promising approach to automating local search solver design.

[233] CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Zeyu Gan, Hao Yi, Yong Liu

Main category: cs.AI

TL;DR: CoT-Space is a theoretical framework that reframes LLM reasoning as continuous optimization in semantic space, explaining optimal CoT length convergence through underfitting-overfitting trade-offs.

Details

Motivation: Traditional token-level RL frameworks don't align with reasoning-level nature of multi-step thought processes like Chain-of-Thought, creating a theoretical gap.

Method: Introduces CoT-Space framework that transforms LLM reasoning from discrete token-prediction to continuous optimization in reasoning-level semantic space, analyzed from noise and risk perspectives.

Result: Demonstrates convergence to optimal CoT length is natural consequence of underfitting-overfitting trade-off, with extensive experiments providing empirical validation.

Conclusion: Provides coherent explanation for empirical phenomena like overthinking and offers solid theoretical foundation for developing more effective reasoning agents.

Abstract: Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents.

[234] Oruga: An Avatar of Representational Systems Theory

Daniel Raggi, Gem Stapleton, Mateja Jamnik, Aaron Stockdill, Grecia Garcia Garcia, Peter C-H. Cheng

Main category: cs.AI

TL;DR: Oruga is an implementation of Representational Systems Theory that enables flexible representation transformations through structure transfer, aiming to make machines more compatible with human cognitive processes.

Details

Motivation: To harness human-like flexible representation capabilities (drawing diagrams, changing representations, creative analogies) and endow machines with similar abilities to make them more compatible with human use.

Method: Developed Oruga system with core data structures based on RST concepts, a communication language, and an engine using structure transfer method for producing representation transformations.

Result: Created a functional implementation of RST with core components and language that can execute representation transformations through structure transfer.

Conclusion: Oruga successfully implements key aspects of Representational Systems Theory, providing a foundation for machines to perform flexible representation transformations similar to human cognitive processes.

Abstract: Humans use representations flexibly. We draw diagrams, change representations and exploit creative analogies across different domains. We want to harness this kind of power and endow machines with it to make them more compatible with human use. Previously we developed Representational Systems Theory (RST) to study the structure and transformations of representations. In this paper we present Oruga (caterpillar in Spanish; a symbol of transformation), an implementation of various aspects of RST. Oruga consists of a core of data structures corresponding to concepts in RST, a language for communicating with the core, and an engine for producing transformations using a method we call structure transfer. In this paper we present an overview of the core and language of Oruga, with a brief example of the kind of transformation that structure transfer can execute.

[235] Intermediate Languages Matter: Formal Languages and LLMs affect Neurosymbolic Reasoning

Alexander Beiser, David Penz, Nysret Musliu

Main category: cs.AI

TL;DR: The choice of formal language significantly impacts neurosymbolic LLM reasoning performance, affecting both syntactic and semantic capabilities across different models.

Details

Motivation: While neurosymbolic LLM reasoning shows promise by combining LLMs as translators with symbolic solvers, the factors contributing to its success remain unclear, particularly the overlooked role of formal language selection.

Method: Comparative analysis of four formal languages across three datasets and seven different LLMs to evaluate how formal language choice affects reasoning capabilities.

Result: The choice of formal language significantly impacts both syntactic and semantic reasoning capabilities in neurosymbolic LLM systems, with varying effects observed across different LLMs.

Conclusion: Formal language selection is a critical factor in neurosymbolic reasoning that should be carefully considered, as it substantially influences reasoning performance and effectiveness across different language models.

Abstract: Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, the contributing factors to the success of Neurosymbolic LLM reasoning remain unclear. This paper demonstrates that one previously overlooked factor is the choice of the formal language. We introduce the intermediate language challenge: selecting a suitable formal language for neurosymbolic reasoning. By comparing four formal languages across three datasets and seven LLMs, we show that the choice of formal language affects both syntactic and semantic reasoning capabilities. We also discuss the varying effects across different LLMs.

[236] Hybrid Reinforcement Learning and Search for Flight Trajectory Planning

Alberto Luise, Michele Lombardi, Florent Teichteil Koenigsbuch

Main category: cs.AI

TL;DR: RL-guided path planning for emergency flight route optimization that speeds up computation by 50% while maintaining near-optimal fuel efficiency (within 1% deviation).

Details

Motivation: Fast route re-calculation is crucial for airliners during emergencies, requiring accelerated path optimization without sacrificing fuel efficiency.

Method: Train RL agent to pre-compute near-optimal paths using location and atmospheric data, then use these paths to constrain traditional path planning solvers and reduce search space.

Result: Empirical tests with Airbus performance models show fuel consumption nearly identical to unconstrained solver (within 1% deviation) while computation speed improves by up to 50%.

Conclusion: Combining RL pre-computation with traditional solvers effectively accelerates emergency route optimization while maintaining near-optimal fuel efficiency, though global optimality is not guaranteed.

Abstract: This paper explores the combination of Reinforcement Learning (RL) and search-based path planners to speed up the optimization of flight paths for airliners, where in case of emergency a fast route re-calculation can be crucial. The fundamental idea is to train an RL Agent to pre-compute near-optimal paths based on location and atmospheric data and use those at runtime to constrain the underlying path planning solver and find a solution within a certain distance from the initial guess. The approach effectively reduces the size of the solver’s search space, significantly speeding up route optimization. Although global optimality is not guaranteed, empirical results conducted with Airbus aircraft’s performance models show that fuel consumption remains nearly identical to that of an unconstrained solver, with deviations typically within 1%. At the same time, computation speed can be improved by up to 50% as compared to using a conventional solver alone.

[237] Analysis of Bluffing by DQN and CFR in Leduc Hold’em Poker

Tarik Zaciragic, Aske Plaat, K. Joost Batenburg

Main category: cs.AI

TL;DR: Study examines bluffing behavior in poker AI algorithms DQN and CFR, finding both exhibit bluffing but in different ways with similar success rates.

Details

Motivation: While bluffing is essential in human poker play, most computer poker research focuses on win rates rather than bluffing behavior. This paper investigates whether popular AI algorithms actually bluff.

Method: Designed experiment where DQN (reinforcement learning) and CFR (game theory) agents played against each other in Leduc Hold’em poker while logging their actions to analyze bluffing behavior.

Result: Both DQN and CFR exhibit bluffing behavior but in different ways. Although they attempt bluffs at different rates, the percentage of successful bluffs (where opponent folds) is roughly the same.

Conclusion: Bluffing is an essential aspect of the game itself rather than being algorithm-specific. Future work should explore different bluffing styles and full poker games.

Abstract: In the game of poker, being unpredictable, or bluffing, is an essential skill. When humans play poker, they bluff. However, most works on computer-poker focus on performance metrics such as win rates, while bluffing is overlooked. In this paper we study whether two popular algorithms, DQN (based on reinforcement learning) and CFR (based on game theory), exhibit bluffing behavior in Leduc Hold’em, a simplified version of poker. We designed an experiment where we let the DQN and CFR agent play against each other while we log their actions. We find that both DQN and CFR exhibit bluffing behavior, but they do so in different ways. Although both attempt to perform bluffs at different rates, the percentage of successful bluffs (where the opponent folds) is roughly the same. This suggests that bluffing is an essential aspect of the game, not of the algorithm. Future work should look at different bluffing styles and at the full game of poker. Code at https://github.com/TarikZ03/Bluffing-by-DQN-and-CFR-in-Leduc-Hold-em-Poker-Codebase.

[238] The human biological advantage over AI

William Stewart

Main category: cs.AI

TL;DR: AI may surpass human capabilities but cannot match human emotional connection to reality through the central nervous system, making humans uniquely qualified for universal leadership.

Details

Motivation: To explore whether AI systems could become superior to humans and assume leadership, examining the fundamental differences between biological and artificial intelligence.

Method: Philosophical analysis comparing human central nervous system capabilities (emotional experience, consequence understanding) with potential AI capabilities.

Result: AI may achieve superior cognitive abilities but cannot develop the emotional understanding and ethical foundation that human biology provides through the CNS.

Conclusion: Human DNA-based biological systems with central nervous systems provide the essential foundation for ethical leadership, making humans uniquely qualified to lead despite AI’s potential cognitive superiority.

Abstract: Recent advances in AI raise the possibility that AI systems will one day be able to do anything humans can do, only better. If artificial general intelligence (AGI) is achieved, AI systems may be able to understand, reason, problem solve, create, and evolve at a level and speed that humans will increasingly be unable to match, or even understand. These possibilities raise a natural question as to whether AI will eventually become superior to humans, a successor “digital species”, with a rightful claim to assume leadership of the universe. However, a deeper consideration suggests the overlooked differentiator between human beings and AI is not the brain, but the central nervous system (CNS), providing us with an immersive integration with physical reality. It is our CNS that enables us to experience emotion including pain, joy, suffering, and love, and therefore to fully appreciate the consequences of our actions on the world around us. And that emotional understanding of the consequences of our actions is what is required to be able to develop sustainable ethical systems, and so be fully qualified to be the leaders of the universe. A CNS cannot be manufactured or simulated; it must be grown as a biological construct. And so, even the development of consciousness will not be sufficient to make AI systems superior to humans. AI systems may become more capable than humans on almost every measure and transform our society. However, the best foundation for leadership of our universe will always be DNA, not silicon.

[239] Towards an Action-Centric Ontology for Cooking Procedures Using Temporal Graphs

Aarush Kumbhakern, Saransh Kumar Gupta, Lipika Dey, Partha Pratim Das

Main category: cs.AI

TL;DR: A domain-specific language for representing recipes as directed action graphs, enabling precise modeling of complex cooking procedures with temporal relationships and compositional structure.

Details

Motivation: Formalizing cooking procedures is challenging due to their complexity and ambiguity. Current recipe representations lack the precision needed for automated analysis and execution.

Method: Developed an extensible domain-specific language that represents recipes as directed action graphs, capturing processes, transfers, environments, concurrency, and compositional structure.

Result: Initial manual evaluation on a full English breakfast recipe demonstrated the DSL’s expressiveness and suitability for automated recipe analysis and execution.

Conclusion: This work provides initial steps towards an action-centric ontology for cooking using temporal graphs, enabling structured machine understanding and scalable automation of culinary processes in both home and professional settings.

Abstract: Formalizing cooking procedures remains a challenging task due to their inherent complexity and ambiguity. We introduce an extensible domain-specific language for representing recipes as directed action graphs, capturing processes, transfers, environments, concurrency, and compositional structure. Our approach enables precise, modular modeling of complex culinary workflows. Initial manual evaluation on a full English breakfast recipe demonstrates the DSL’s expressiveness and suitability for future automated recipe analysis and execution. This work represents initial steps towards an action-centric ontology for cooking, using temporal graphs to enable structured machine understanding, precise interpretation, and scalable automation of culinary processes - both in home kitchens and professional culinary settings.

[240] Domain size asymptotics for Markov logic networks

Vera Koponen

Main category: cs.AI

TL;DR: Analysis of Markov logic networks’ asymptotic behavior as domain size increases, focusing on three specific MLN types and their limit distributions, showing differences from uniform distributions and incomparability with lifted Bayesian networks.

Details

Motivation: To understand how Markov logic networks behave as domain sizes approach infinity, examining different types of constraints and their effects on probability distributions over possible worlds.

Method: Study three concrete MLN examples: (1) quantifier-free MLNs with unary relations, (2) MLNs favoring graphs with fewer triangles/cliques, (3) MLNs favoring graphs with fewer high-degree vertices. Analyze limit behaviors and compare with uniform distributions and lifted Bayesian networks.

Result: Different MLN constraints lead to varied asymptotic behaviors. Quantifier-free MLNs show complete characterization of limit behaviors. Triangle-reducing MLNs yield approximate 0-1 laws. Degree-constraining MLNs demonstrate weight-dependent behaviors. MLNs concentrate probability mass differently than uniform distributions and are asymptotically incomparable with lifted Bayesian networks.

Conclusion: MLNs exhibit diverse asymptotic behaviors depending on constraint types, with weights sometimes influencing limits. They concentrate probability differently from uniform distributions and are fundamentally different from lifted Bayesian networks in large-domain limit behavior.

Abstract: A Markov logic network (MLN) determines a probability distribution on the set of structures, or possible worlds'', with an arbitrary finite domain. We study the properties of such distributions as the domain size tends to infinity. Three types of concrete examples of MLNs will be considered, and the properties of random structures with domain sizes tending to infinity will be studied: (1) Arbitrary quantifier-free MLNs over a language with only one relation symbol which has arity 1. In this case we give a pretty complete characterization of the possible limit behaviours of random structures. (2) An MLN that favours graphs with fewer triangles (or more generally, fewer k-cliques). As a corollary of the analysis a $\delta$-approximate 0-1 law’’ for first-order logic is obtained. (3) An MLN that favours graphs with fewer vertices with degree higher than a fixed (but arbitrary) number. The analysis shows that depending on which ``soft constraints’’ an MLN uses the limit behaviour of random structures can be quite different, and the weights of the soft constraints may, or may not, have influence on the limit behaviour. It will also be demonstrated, using (1), that quantifier-free MLNs and lifted Bayesian networks (in a broad sense) are asymptotically incomparable, roughly meaning that there is a sequence of distributions on possible worlds with increasing domain sizes that can be defined by one of the formalisms but not even approximated by the other. In a rather general context it is also shown that on large domains the distribution determined by an MLN concentrates almost all its probability mass on a totally different part of the space of possible worlds than the uniform distribution does.

[241] Evaluating Quality of Gaming Narratives Co-created with AI

Arturo Valdivia, Paolo Burelli

Main category: cs.AI

TL;DR: A structured methodology using Delphi study with narrative experts to evaluate AI-generated game narratives, mapping quality dimensions to Kano model for player satisfaction insights.

Details

Motivation: To provide game developers with a systematic way to evaluate and prioritize quality aspects when co-creating game narratives with generative AI.

Method: Leverages Delphi study structure with narrative design experts panel, synthesizes story quality dimensions from literature and expert insights, maps them into Kano model framework.

Result: Provides insights on how different quality dimensions impact player satisfaction, enabling prioritization of narrative quality aspects.

Conclusion: The methodology can effectively inform game developers about which narrative quality dimensions to prioritize when working with AI-generated content to maximize player satisfaction.

Abstract: This paper proposes a structured methodology to evaluate AI-generated game narratives, leveraging the Delphi study structure with a panel of narrative design experts. Our approach synthesizes story quality dimensions from literature and expert insights, mapping them into the Kano model framework to understand their impact on player satisfaction. The results can inform game developers on prioritizing quality aspects when co-creating game narratives with generative AI.

[242] EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Main category: cs.AI

TL;DR: EvoEmo is an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in LLM negotiations, outperforming baseline strategies with higher success rates and efficiency.

Details

Motivation: Existing LLM agents overlook the functional role of emotions in negotiations, generating passive emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts.

Method: Models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios.

Result: Extensive experiments show EvoEmo consistently outperforms both vanilla strategies and fixed-emotion strategies, achieving higher success rates, higher efficiency, and increased buyer savings.

Conclusion: The findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

Abstract: Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

[243] Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes

Isidoro Tamassia, Wendelin Böhmer

Main category: cs.AI

TL;DR: AlphaZero framework adaptation for changed test environments with simple modifications that boost performance even with low planning budget.

Details

Motivation: AlphaZero assumes static environments from training to test, limiting real-world applicability where environments may change. This paper addresses deployment in potentially changed test environments.

Method: Combination of simple modifications to the standard AlphaZero framework to handle environment changes at test time.

Result: Significant performance boost demonstrated, even in settings with limited planning budget available.

Conclusion: The proposed modifications enable effective deployment of AlphaZero agents in changed environments, expanding the framework’s practical applicability.

Abstract: The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.

[244] ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Zhijian Liu, Zhiting Hu, Lianhui Qin

Main category: cs.AI

TL;DR: The paper proposes concept-level memory for LLMs that distills reusable abstractions from reasoning traces, enabling test-time continual learning without weight updates and achieving 7.5% relative gain on ARC-AGI benchmark.

Details

Motivation: Current LLMs discard valuable patterns and insights from reasoning traces once context windows reset. External memory could persist these discoveries, but existing instance-based memory approaches lack reusability and scalability.

Method: Develops concept-level memory with strategies for abstracting takeaways from solution rollouts and retrieving relevant entries for new queries. Stores reusable, modular abstractions in natural language that can be selectively integrated into prompts.

Result: Achieves 7.5% relative gain over strong no-memory baseline on ARC-AGI benchmark, with performance scaling with inference compute. Abstract concepts proved most consistent, outperforming baseline at all tested compute scales.

Conclusion: Concept-level memory enables effective test-time continual learning through reusable abstractions, supporting self-improvement as solving more problems and abstracting patterns to memory enables further solutions.

Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at https://github.com/matt-seb-ho/arc_memo.

[245] Intelligence Primer

Karl Fezer, Andrew Sloss

Main category: cs.AI

TL;DR: This paper explores the multidisciplinary nature of intelligence across biological and artificial systems, examining its foundations and implications for future AI development.

Details

Motivation: To understand the complex, interdisciplinary nature of intelligence and explore the essential components that could shape future artificial intelligence systems, addressing the need for broader understanding across multiple disciplines.

Method: The authors conduct an exploratory journey through different aspects of intelligence, drawing from Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science, using a primer format inspired by Douglas Adams’ science fiction.

Result: The paper provides a comprehensive framework for understanding intelligence as a multifaceted concept and highlights the necessity for engineers and scientists to expand their knowledge to include psychology, philosophy, and ethics in AI development.

Conclusion: Intelligence is not a single measurable quantity but a complex interdisciplinary subject that requires broad understanding across multiple fields for meaningful progress in artificial intelligence, with emerging AI technologies acting as catalysts for this necessary interdisciplinary approach.

Abstract: Intelligence is a fundamental part of all living things, as well as the foundation for Artificial Intelligence. In this primer we explore the ideas associated with intelligence and, by doing so, understand the implications and constraints and potentially outline the capabilities of future systems. Artificial Intelligence, in the form of Machine Learning, has already had a significant impact on our lives. As an exploration, we journey into different parts of intelligence that appear essential. We hope that people find this helpful in determining the future. Also, during the exploration, we hope to create new thought-provoking questions. Intelligence is not a single weighable quantity but a subject that spans Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science. The historian Yuval Noah Harari pointed out that engineers and scientists in the future will have to broaden their understandings to include disciplines such as Psychology, Philosophy, and Ethics. Fiction writers have long portrayed engineers and scientists as deficient in these areas. Today, in modern society, the emergence of Artificial Intelligence and legal requirements act as forcing functions to push these broader subjects into the foreground. We start with an introduction to intelligence and move quickly to more profound thoughts and ideas. We call this a Life, the Universe, and Everything primer, after the famous science fiction book by Douglas Adams. Forty-two may be the correct answer, but what are the questions?

[246] Transferable Belief Model on Quantum Circuits

Qianli Zhou, Hao Luo, Lipeng Pan, Yong Deng, Eloi Bosse

Main category: cs.AI

TL;DR: The paper implements the transferable belief model on quantum circuits, showing belief functions are more concise and effective than Bayesian approaches in quantum computing, and proposes novel belief transfer methods leveraging quantum characteristics.

Details

Motivation: The transferable belief model provides better semantics for handling unreliable testimonies and imprecise environments compared to Bayesian approaches, but has faced computational complexity issues that made it less popular. Quantum computing offers a way to overcome these limitations.

Method: Implementation of the transferable belief model on quantum circuits, leveraging quantum computing characteristics to develop novel belief transfer approaches and reduce computational complexity.

Result: Belief functions provide a more concise and effective alternative to Bayesian approaches within quantum computing framework, with several novel belief transfer methods demonstrated.

Conclusion: Belief functions are more suitable than Bayesian approaches for handling uncertainty on quantum circuits, offering a new perspective for quantum AI information representation.

Abstract: The transferable belief model, as a semantic interpretation of Dempster-Shafer theory, enables agents to perform reasoning and decision making in imprecise and incomplete environments. The model offers distinct semantics for handling unreliable testimonies, allowing for a more reasonable and general process of belief transfer compared to the Bayesian approach. However, because both the belief masses and the structure of focal sets must be considered when updating belief functions-leading to extra computational complexity during reasoning-the transferable belief model has gradually lost favor among researchers in recent developments. In this paper, we implement the transferable belief model on quantum circuits and demonstrate that belief functions offer a more concise and effective alternative to Bayesian approaches within the quantum computing framework. Furthermore, leveraging the unique characteristics of quantum computing, we propose several novel belief transfer approaches. More broadly, this paper introduces a new perspective on basic information representation for quantum AI models, suggesting that belief functions are more suitable than Bayesian approach for handling uncertainty on quantum circuits.

[247] WASP: A Weight-Space Approach to Detecting Learned Spuriousness

Cristian Daniel Păduraru, Antonio Bărbălau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu

Main category: cs.AI

TL;DR: WASP is a novel method that analyzes model weights instead of predictions to detect spurious correlations in machine learning models, overcoming limitations of data/error analysis approaches.

Details

Motivation: Current approaches rely solely on data or error analysis and cannot identify spurious correlations not already present in validation/training counterexamples, limiting their effectiveness.

Method: Weight-space Approach to detecting Spuriousness (WASP) analyzes foundation model weights as they drift during fine-tuning to capture spurious correlations, focusing on the decision-making mechanism rather than predictions.

Result: WASP can expose spurious correlations even when not present in training/validation data, works across multiple modalities (image and text), and uncovers previously unknown spurious correlations in ImageNet-1k classifiers.

Conclusion: Analyzing model weights provides deeper insights into spurious correlations than traditional prediction analysis, enabling more comprehensive detection of problematic model behaviors across diverse domains.

Abstract: It is of crucial importance to train machine learning models such that they clearly understand what defines each class in a given task. Though there is a sum of works dedicated to identifying the spurious correlations featured by a dataset that may impact the model’s understanding of the classes, all current approaches rely solely on data or error analysis. That is, they cannot point out spurious correlations learned by the model that are not already pointed out by the counterexamples featured in the validation or training sets. We propose a method that transcends this limitation, switching the focus from analyzing a model’s predictions to analyzing the model’s weights, the mechanism behind the making of the decisions, which proves to be more insightful. Our proposed Weight-space Approach to detecting Spuriousness (WASP) relies on analyzing the weights of foundation models as they drift towards capturing various (spurious) correlations while being fine-tuned on a given dataset. We demonstrate that different from previous works, our method (i) can expose spurious correlations featured by a dataset even when they are not exposed by training or validation counterexamples, (ii) it works for multiple modalities such as image and text, and (iii) it can uncover previously untapped spurious correlations learned by ImageNet-1k classifiers.

[248] Enhancing FKG.in: automating Indian food composition analysis

Saransh Kumar Gupta, Lipika Dey, Partha Pratim Das, Geeta Trilok-Kumar, Ramesh Jain

Main category: cs.AI

TL;DR: Automated workflow using knowledge graphs and LLMs to compute food composition data for Indian recipes, addressing challenges in data aggregation and multilingual analysis.

Details

Motivation: To address the challenges of representing Indian food digitally and accessing reliable food composition data, complementing existing knowledge bases with automated analysis.

Method: Uses FKG.in knowledge graph and LLMs for nutrition data aggregation, food composition analysis, and information resolution, with application-agnostic approaches.

Result: Developed an automated workflow that can provide diet-based health recommendations and detailed food composition information for numerous Indian recipes.

Conclusion: The proposed LLM-based methods for knowledge curation and information resolution are generalizable and replicable across domains beyond food analysis.

Abstract: This paper presents a novel approach to compute food composition data for Indian recipes using a knowledge graph for Indian food (FKG[.]in) and LLMs. The primary focus is to provide a broad overview of an automated food composition analysis workflow and describe its core functionalities: nutrition data aggregation, food composition analysis, and LLM-augmented information resolution. This workflow aims to complement FKG[.]in and iteratively supplement food composition data from verified knowledge bases. Additionally, this paper highlights the challenges of representing Indian food and accessing food composition data digitally. It also reviews three key sources of food composition data: the Indian Food Composition Tables, the Indian Nutrient Databank, and the Nutritionix API. Furthermore, it briefly outlines how users can interact with the workflow to obtain diet-based health recommendations and detailed food composition information for numerous recipes. We then explore the complex challenges of analyzing Indian recipe information across dimensions such as structure, multilingualism, and uncertainty as well as present our ongoing work on LLM-based solutions to address these issues. The methods proposed in this workshop paper for AI-driven knowledge curation and information resolution are application-agnostic, generalizable, and replicable for any domain.

[249] Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers

Hannah Calzi Kleidermacher, James Zou

Main category: cs.AI

TL;DR: LLM-based scientific paper translation system that preserves JATS XML formatting, achieves 95.9% accuracy via QA benchmarking, and addresses overtranslation issues with in-context learning.

Details

Motivation: Overcome language barriers in scientific research caused by English-only journal publications, enabling global accessibility for non-native-English-speaking researchers.

Method: Leverage large language models to translate scientific articles while preserving native JATS XML formatting, using a novel QA benchmarking method for evaluation and in-context learning for domain-specific adaptation.

Result: Achieved 95.9% average translation accuracy across 28 languages, with authors confirming translation accuracy but identifying overtranslation of technical terms as a common issue.

Conclusion: LLM-driven translation provides practical, automated solution for scientific paper translation, with in-context learning enabling customization to address domain-specific preferences like mitigating overtranslation of technical terminology.

Abstract: Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method, in which an LLM generates comprehension-based questions from the original text and then answers them based on the translated text. Our benchmark results show an average performance of 95.9%, showing that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages, finding that the authors consistently found the translations to accurately capture the original information in their articles. Interestingly, a third of the authors found many technical terms “overtranslated,” expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation. The code and translated articles are available at https://hankleid.github.io/ProjectMundo.

Ji Ma

Main category: cs.AI

TL;DR: This paper proposes methods to probe and manipulate LLM internal representations in social decision-making contexts, specifically using the Dictator Game to study how character traits and contexts affect fairness behavior.

Details

Motivation: LLMs are increasingly used as human-like decision agents in social science, but how character assignments and contexts shape their behavior remains poorly understood. The study aims to systematically investigate and control how social concepts are encoded in transformer models.

Method: Extracts “vectors of variable variations” from LLM internal states and manipulates these vectors during inference to alter how social variables relate to decision-making in a Dictator Game paradigm.

Result: The approach can substantially alter how variables like gender affect the model’s decision-making, demonstrating effective control over social concept representations.

Conclusion: This provides a principled framework for studying and regulating social concept encoding in LLMs, with applications for AI alignment, debiasing, and designing social simulation agents, contributing to sociological theory and measurement.

Abstract: Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM’s behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM’s internal representations in a Dictator Game – a classic behavioral experiment on fairness and prosocial behavior. We extract vectors of variable variations'' (e.g., male’’ to ``female’’) from the LLM’s internal state. Manipulating these vectors during the model’s inference can substantially alter how those variables relate to the model’s decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.

[251] DMN-Guided Prompting: A Framework for Controlling LLM Behavior

Shaghayegh Abedi, Amin Jalali

Main category: cs.AI

TL;DR: DMN-guided prompting framework improves LLM decision-making by breaking complex logic into structured components, outperforming chain-of-thought prompting in educational feedback generation.

Details

Motivation: LLMs show potential for automating decision logic in knowledge processes, but their effectiveness depends on prompting quality and end users struggle to modify embedded decision logic in prompts.

Method: Developed a DMN (Decision Model and Notation)-guided prompting framework that decomposes complex decision logic into manageable components and guides LLMs through structured decision pathways. Implemented in a graduate course where student assignments and DMN feedback models were used as inputs.

Result: The framework demonstrated promising results, outperforming chain-of-thought prompting. Students responded positively with high perceived usefulness based on Technology Acceptance Model survey.

Conclusion: DMN-guided prompting provides an effective structured approach for LLM decision-making, offering better performance than CoT prompting and high user acceptance in educational feedback applications.

Abstract: Large Language Models (LLMs) have shown considerable potential in automating decision logic within knowledge-intensive processes. However, their effectiveness largely depends on the strategy and quality of prompting. Since decision logic is typically embedded in prompts, it becomes challenging for end users to modify or refine it. Decision Model and Notation (DMN) offers a standardized graphical approach for defining decision logic in a structured, user-friendly manner. This paper introduces a DMN-guided prompting framework that breaks down complex decision logic into smaller, manageable components, guiding LLMs through structured decision pathways. We implemented the framework in a graduate-level course where students submitted assignments. The assignments and DMN models representing feedback instructions served as inputs to our framework. The instructor evaluated the generated feedback and labeled it for performance assessment. Our approach demonstrated promising results, outperforming chain-of-thought (CoT) prompting in our case study. Students also responded positively to the generated feedback, reporting high levels of perceived usefulness in a survey based on the Technology Acceptance Model.

[252] Axiomatics of Restricted Choices by Linear Orders of Sets with Minimum as Fallback

Kai Sauerwald, Kenneth Skiba, Eduardo Fermé, Thomas Meyer

Main category: cs.AI

TL;DR: Linear orders on sets can construct choice functions for restricted domains where standard relation-based approaches fail, with applications in knowledge representation.

Details

Motivation: To address the challenge of constructing choice functions when the set of potential choices is restricted (not the full powerset), which occurs in practical applications where not all alternatives are available.

Method: Using linear orders on sets of alternatives rather than on individual alternatives, and incorporating fallback values as minimal elements in the linear order. The approach is analyzed through axiomatic characterization for general cases and union-closed input restrictions.

Result: The paper demonstrates that choice functions can always be constructed via linear orders on sets, even with restricted domains and fallback values, overcoming limitations of traditional relation-based approaches.

Conclusion: Linear orders on sets provide a robust framework for constructing choice functions in restricted settings, with practical applications in knowledge representation, theory change, and abstract argumentation.

Abstract: We study how linear orders can be employed to realise choice functions for which the set of potential choices is restricted, i.e., the possible choice is not possible among the full powerset of all alternatives. In such restricted settings, constructing a choice function via a relation on the alternatives is not always possible. However, we show that one can always construct a choice function via a linear order on sets of alternatives, even when a fallback value is encoded as the minimal element in the linear order. The axiomatics of such choice functions are presented for the general case and the case of union-closed input restrictions. Restricted choice structures have applications in knowledge representation and reasoning, and here we discuss their applications for theory change and abstract argumentation.

[253] CP-Bench: Evaluating Large Language Models for Constraint Modelling

Kostis Michailidis, Dimos Tsouros, Tias Guns

Main category: cs.AI

TL;DR: CP-Bench benchmark introduced to evaluate LLM performance in constraint programming modelling across diverse problems and frameworks, showing best results with high-level Python frameworks and improved accuracy with advanced prompting techniques.

Details

Motivation: Constraint modelling is a bottleneck in CP adoption requiring expertise, and existing evaluation datasets are limited in diversity and scope, failing to represent real-world scenarios.

Method: Created CP-Bench benchmark with diverse combinatorial problems from CP community, evaluated LLMs across three constraint modelling systems with different abstraction levels, and tested prompt-based and inference-time compute methods.

Result: Higher performance achieved with high-level Python-based framework, and advanced prompting techniques increased accuracy up to 70% on this challenging benchmark.

Conclusion: CP-Bench provides comprehensive evaluation for LLM-driven constraint modelling, demonstrating framework choice and advanced prompting significantly impact performance, enabling better automation of constraint modelling.

Abstract: Constraint Programming (CP) is widely used to solve combinatorial problems, but its core process, namely constraint modelling, requires significant expertise and is considered to be a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing CP-Bench, a novel benchmark that includes a diverse set of well-known combinatorial problems sourced from the CP community, structured explicitly for evaluating LLM-driven CP modelling. With this dataset, and given the variety of constraint modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 70% on this highly challenging benchmark.

[254] DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning

Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, Weikai Yang

Main category: cs.AI

TL;DR: Proposes Chain-of-Thought reasoning for NL2VIS to make visualization generation transparent and improvable, with new dataset and interactive interface.

Details

Motivation: Existing NL2VIS methods are black boxes without transparent reasoning, preventing users from understanding design rationales and refining suboptimal visualizations.

Method: Integrates CoT reasoning into NL2VIS pipeline, creates nvBench-CoT dataset with structured reasoning steps, and develops DeepVIS interactive interface for inspection and adjustment.

Result: Quantitative benchmarks, use cases, and user study show CoT framework enhances NL2VIS quality while providing insightful reasoning steps to users.

Conclusion: CoT reasoning effectively bridges the transparency gap in NL2VIS systems, enabling better visualization outcomes through explainable and adjustable reasoning processes.

Abstract: Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.

[255] Extending FKG.in: Towards a Food Claim Traceability Network

Saransh Kumar Gupta, Rizwan Gulzar Mir, Lipika Dey, Partha Pratim Das, Anirban Sen, Ramesh Jain

Main category: cs.AI

TL;DR: Proposes a Food Claim-Traceability Network (FCN) as an extension to an Indian food knowledge graph to systematically trace, verify, and contextualize diverse food claims using structured ontologies and semi-automated curation with LLMs.

Details

Motivation: Address the fragmented infrastructure for tracing and verifying food claims ranging from scientific evidence to cultural beliefs and commercial promises, which currently lack systematic validation methods.

Method: Developed FCN as an extension of FKG[.]in knowledge graph using ontology design and semi-automated knowledge curation workflow with Reddit data and Large Language Models for claim extraction and validation.

Result: Created a proof-of-concept FKG[.]in-FCN that integrates curated data, structured schemas, and provenance-aware pipelines for food claim traceability and validation.

Conclusion: The methodology provides a structured, verifiable, and explainable approach to food claim modeling that is application-agnostic and adaptable to various culinary and regulatory contexts, aiming to create more transparent food knowledge ecosystems.

Abstract: The global food landscape is rife with scientific, cultural, and commercial claims about what foods are, what they do, what they should not do, or should not do. These range from rigorously studied health benefits (probiotics improve gut health) and misrepresentations (soaked almonds make one smarter) to vague promises (superfoods boost immunity) and culturally rooted beliefs (cold foods cause coughs). Despite their widespread influence, the infrastructure for tracing, verifying, and contextualizing these claims remains fragmented and underdeveloped. In this paper, we propose a Food Claim-Traceability Network (FCN) as an extension of FKG[.]in, a knowledge graph of Indian food that we have been incrementally building. We also present the ontology design and the semi-automated knowledge curation workflow that we used to develop a proof of concept of FKG[.]in-FCN using Reddit data and Large Language Models. FCN integrates curated data inputs, structured schemas, and provenance-aware pipelines for food-related claim extraction and validation. While directly linked to the Indian food knowledge graph as an application, our methodology remains application-agnostic and adaptable to other geographic, culinary, or regulatory settings. By modeling food claims and their traceability in a structured, verifiable, and explainable way, we aim to contribute to more transparent and accountable food knowledge ecosystems, supporting researchers, policymakers, and most importantly, everyday consumers in navigating a world saturated with dietary assertions.

[256] Oyster-I: Beyond Refusal – Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

Main category: cs.AI

TL;DR: CSA paradigm shifts from refusal-based safety to constructive guidance, protecting against misuse while actively helping vulnerable users through game-theoretic anticipation and interpretable reasoning.

Details

Motivation: Current LLM safety focuses on adversarial risks but neglects non-malicious users in psychological distress, where simple refusals can worsen outcomes by driving users to unsafe platforms.

Method: Constructive Safety Alignment (CSA) combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control to guide users toward safe outcomes.

Result: Oyster-I (Oy1) achieves SOTA safety among open models, strong constructive engagement close to GPT-5, and unmatched robustness on jailbreak datasets nearing GPT-o1 levels while maintaining high general capabilities.

Conclusion: CSA redefines model-user relationships from refusal-first to guidance-first safety, creating systems that are not just safe but meaningfully helpful for vulnerable users.

Abstract: Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model’s response can strongly influence the user’s next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

[257] EigenBench: A Comparative Behavioral Measure of Value Alignment

Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine

Main category: cs.AI

TL;DR: EigenBench is a black-box benchmarking method that quantifies language models’ value alignment using peer evaluation and EigenTrust aggregation, without requiring ground truth labels.

Details

Motivation: Addressing the lack of quantitative metrics for AI value alignment by creating a method to comparatively benchmark language models' values against a given constitution.

Method: Uses an ensemble of models to judge each other’s outputs across scenarios, aggregates judgments with EigenTrust algorithm to produce alignment scores, and employs prompted personas to test sensitivity.

Result: Most variance in scores is explained by the prompt rather than the model itself, but a small residual quantifies the model’s inherent disposition.

Conclusion: EigenBench provides a practical framework for quantifying value alignment in language models through peer evaluation, enabling comparative benchmarking without ground truth data.

Abstract: Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al, 2003), yielding scores that reflect a weighted-average judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify traits for which reasonable judges may disagree on the correct label. Using prompted personas, we test whether EigenBench scores are more sensitive to the model or the prompt: we find that most of the variance is explained by the prompt, but a small residual quantifies the disposition of the model itself.

[258] Plan Verification for LLM-Based Embodied Task Completion Agents

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Main category: cs.AI

TL;DR: Iterative LLM framework for refining noisy embodied AI plans using Judge-LLM critique and Planner-LLM revision to improve spatial coherence and action quality.

Details

Motivation: LLM-based task plans and human demonstrations in embodied AI often contain noise, unnecessary actions, redundant navigation, and logical errors that degrade policy performance.

Method: Proposes iterative verification with Judge LLM critiquing action sequences and Planner LLM applying revisions, using natural language prompting for broad error generalization across irrelevant actions, contradictions, and missing steps.

Result: Achieves up to 90% recall and 100% precision on TEACh dataset across four state-of-the-art LLMs, with 96.5% sequences converging in ≤3 iterations while improving temporal efficiency and spatial organization.

Conclusion: Establishes plan verification as reliable LLM capability for spatial planning, providing scalable path to higher-quality training data for imitation learning while preserving human error-recovery patterns.

Abstract: Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

cs.SD

[259] SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv

Main category: cs.SD

TL;DR: SwinSRGAN is an end-to-end speech super-resolution framework using Swin Transformer-based U-Net with hybrid adversarial training on MDCT magnitudes, achieving real-time 48kHz upsampling with improved quality and cross-domain generalization.

Details

Motivation: Existing speech SR systems suffer from representation mismatch in two-stage pipelines, CNN over-smoothing, and computational inefficiency of diffusion/flow models with limited robustness across domains and sampling rates.

Method: Uses Swin Transformer-based U-Net on MDCT magnitudes with hybrid adversarial scheme combining time-domain MPD/MSD discriminators and multi-band MDCT discriminator. Includes sparse-aware regularizer on arcsinh-compressed MDCT to preserve transients.

Result: Reduces objective error and improves ABX preference scores on benchmarks. Outperforms NVSR and mdctGAN in zero-shot tests on HiFi-TTS without fine-tuning, demonstrating strong generalization.

Conclusion: SwinSRGAN provides an effective end-to-end solution for speech super-resolution with real-time performance, superior quality, and excellent cross-dataset generalization capabilities.

Abstract: Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets

[260] WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang, Hongfei Xue, Tianlun Zuo, Chengyou Wang, Shuiyuan Wang, Jie Li, Xin Xu, Hui Bu, Binbin Zhang, Ruibin Yuan, Ziya Zhou, Wei Xue, Lei Xie

Main category: cs.SD

TL;DR: WenetSpeech-Pipe pipeline creates WenetSpeech-Yue, the first large-scale Cantonese speech corpus with 21,800 hours of multi-dimensional annotations, enabling competitive ASR and TTS performance against SOTA systems.

Details

Motivation: Cantonese has 84.9 million native speakers but limited annotated resources, hindering progress in ASR and TTS performance for this major Chinese dialect.

Method: Developed WenetSpeech-Pipe pipeline with six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting for multi-dimensional annotation.

Result: Created WenetSpeech-Yue corpus (21,800 hours across 10 domains) and WSYue-eval benchmark. Models trained on this dataset achieve competitive results against SOTA Cantonese ASR and TTS systems, including commercial and LLM-based models.

Conclusion: The pipeline and released dataset successfully address the resource scarcity for Cantonese speech processing, demonstrating value through competitive performance against existing systems.

Abstract: The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

[261] Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis

Zhitong Zhou, Qingqing Zhang, Lei Luo, Jiechen Liu, Ruohua Zhou

Main category: cs.SD

TL;DR: Two open-source conversational speech datasets (Chinese and English) with 15 hours of natural conversations featuring realistic interaction patterns like overlaps and backchannels, used to improve TTS naturalness through fine-tuning.

Details

Motivation: Full-duplex spontaneous conversational data is essential for enhancing naturalness and interactivity in conversational TTS systems, but such realistic datasets are limited.

Method: Collected 15 hours of natural conversations in isolated rooms with separate high-quality audio tracks, covering diverse daily topics with realistic interaction patterns. Introduced data collection, transcription, and annotation methods, then fine-tuned baseline TTS models with the datasets.

Result: The fine-tuned TTS model achieved higher subjective and objective evaluation metrics compared to the baseline, indicating improved naturalness and conversational realism in synthetic speech.

Conclusion: The datasets successfully enhance TTS naturalness and all data, annotations, and code are made available to facilitate further research in conversational speech synthesis.

Abstract: Full-duplex, spontaneous conversational data are essential for enhancing the naturalness and interactivity of synthesized speech in conversational TTS systems. We present two open-source dual-track conversational speech datasets, one in Chinese and one in English, designed to enhance the naturalness of synthesized speech by providing more realistic conversational data. The two datasets contain a total of 15 hours of natural, spontaneous conversations recorded in isolated rooms, which produces separate high-quality audio tracks for each speaker. The conversations cover diverse daily topics and domains, capturing realistic interaction patterns including frequent overlaps, backchannel responses, laughter, and other non-verbal vocalizations. We introduce the data collection procedure, transcription and annotation methods. We demonstrate the utility of these corpora by fine-tuning a baseline TTS model with the proposed datasets. The fine-tuned TTS model achieves higher subjective and objective evaluation metrics compared to the baseline, indicating improved naturalness and conversational realism in synthetic speech. All data, annotations, and supporting code for fine-tuning and evaluation are made available to facilitate further research in conversational speech synthesis.

[262] Enhancing Self-Supervised Speaker Verification Using Similarity-Connected Graphs and GCN

Zhaorui Sun, Yihao Chen, Jialong Wang, Minqiang Xu, Lei Fang, Sian Fang, Lin Liu

Main category: cs.SD

TL;DR: Improved self-supervised speaker verification using GCN-based clustering to reduce noisy pseudo-labels in DINO framework

Details

Motivation: Address noisy pseudo-labels from clustering in self-supervised speaker verification (DINO method) that limit recognition performance due to scarcity of labeled data

Method: Proposes clustering framework using similarity connection graphs and Graph Convolutional Networks to model structured data and node relationships for optimized clustering

Result: Experimental results show significant performance improvement in speaker verification system with enhanced robustness

Conclusion: Provides new approach for self-supervised speaker verification by improving pseudo-label accuracy through GCN-based clustering optimization

Abstract: With the continuous development of speech recognition technology, speaker verification (SV) has become an important method for identity authentication. Traditional SV methods rely on handcrafted feature extraction, while deep learning has significantly improved system performance. However, the scarcity of labeled data still limits the widespread application of deep learning in SV. Self-supervised learning, by mining latent information in large unlabeled datasets, enhances model generalization and is a key technology to address this issue. DINO is an efficient self-supervised learning method that generates pseudo-labels from unlabeled speech data through clustering, supporting subsequent training. However, clustering may produce noisy pseudo-labels, which can reduce overall recognition performance. To address this issue, this paper proposes an improved clustering framework based on similarity connection graphs and Graph Convolutional Networks. By leveraging GCNs’ ability to model structured data and incorporating relational information between nodes in the similarity connection graph, the clustering process is optimized, improving pseudo-label accuracy and enhancing the robustness and performance of the self-supervised speaker verification system. Experimental results show that this method significantly improves system performance and provides a new approach for self-supervised speaker verification. Index Terms: Speaker Verification, Self-Supervised Learning, DINO, Clustering Algorithm, Graph Convolutional Network, Similarity Connection Graph

[263] Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection

Yunqi Hao, Yihao Chen, Minqiang Xu, Jianbo Zhan, Liang He, Lei Fang, Sian Fang, Lin Liu

Main category: cs.SD

TL;DR: Wav2DF-TSL: A two-stage self-supervised learning approach with adapters for spoofed speech pre-training and hierarchical expert fusion for robust audio deepfake detection, achieving 27.5% EER improvement on cross-domain datasets.

Details

Motivation: Existing SSL models rely on large-scale real speech pre-training but lack spoofed sample learning, making them susceptible to domain bias during fine-tuning for audio deepfake detection tasks.

Method: Two-stage strategy: 1) Pre-training with adapters on 3000 hours of unlabelled spoofed speech to learn artifacts efficiently, 2) Fine-tuning with hierarchical adaptive mixture of experts (HA-MoE) to dynamically fuse multi-level spoofing cues through multi-expert collaboration.

Result: Significantly outperforms baseline on all four benchmark datasets, with 27.5% relative improvement in EER on cross-domain In-the-wild dataset, surpassing state-of-the-art systems.

Conclusion: The proposed Wav2DF-TSL framework effectively addresses domain bias in audio deepfake detection by incorporating spoofed speech pre-training and hierarchical expert fusion, demonstrating superior cross-domain generalization capabilities.

Abstract: In recent years, self-supervised learning (SSL) models have made significant progress in audio deepfake detection (ADD) tasks. However, existing SSL models mainly rely on large-scale real speech for pre-training and lack the learning of spoofed samples, which leads to susceptibility to domain bias during the fine-tuning process of the ADD task. To this end, we propose a two-stage learning strategy (Wav2DF-TSL) based on pre-training and hierarchical expert fusion for robust audio deepfake detection. In the pre-training stage, we use adapters to efficiently learn artifacts from 3000 hours of unlabelled spoofed speech, improving the adaptability of front-end features while mitigating catastrophic forgetting. In the fine-tuning stage, we propose the hierarchical adaptive mixture of experts (HA-MoE) method to dynamically fuse multi-level spoofing cues through multi-expert collaboration with gated routing. Experimental results show that the proposed method significantly outperforms the baseline system on all four benchmark datasets, especially on the cross-domain In-the-wild dataset, achieving a 27.5% relative improvement in equal error rate (EER), outperforming the existing state-of-the-art systems. Index Terms: audio deepfake detection, self-supervised learning, parameter-efficient fine-tuning, mixture of experts

[264] PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music

Hayeon Bang, Eunjin Choi, Seungheon Doh, Juhan Nam

Main category: cs.SD

TL;DR: PianoBind is a piano-specific multimodal joint embedding model that captures fine-grained semantic distinctions in solo piano music through audio, symbolic, and textual modalities, outperforming general-purpose music models.

Details

Motivation: Current general-purpose music representation models struggle with subtle semantic distinctions in homogeneous solo piano music, and existing piano-specific models are unimodal, failing to capture the inherently multimodal nature of piano music.

Method: Proposed PianoBind, a piano-specific multimodal joint embedding model that systematically investigates strategies for multi-source training and modality utilization within a joint embedding framework optimized for small-scale homogeneous piano datasets.

Result: PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on both in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models.

Conclusion: The model successfully addresses limitations of existing approaches and provides reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.

Abstract: Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.

[265] AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie

Main category: cs.SD

TL;DR: AUDETER is a large-scale deepfake audio dataset with 4,500+ hours of synthetic audio from 11 TTS models and 10 vocoders, addressing domain shift issues in deepfake detection and enabling 44.1-51.6% error reduction.

Details

Motivation: Current deepfake detection methods struggle with real-world reliability due to domain shift between training and test samples, and existing datasets lack diverse, up-to-date audio samples from both real and deepfake categories.

Method: Created AUDETER dataset with over 4,500 hours of synthetic audio from 11 recent TTS models and 10 vocoders, totaling 3 million audio clips - the largest deepfake audio dataset by scale.

Result: SOTA methods trained on existing datasets fail to generalize to novel deepfake samples and have high false positive rates. Methods trained on AUDETER achieve highly generalized detection performance with 44.1-51.6% error reduction, reaching only 4.17% error rate on cross-domain samples.

Conclusion: AUDETER addresses the critical gap in deepfake audio detection by providing comprehensive, diverse data that enables training of generalist detectors capable of handling real-world domain shift challenges.

Abstract: Speech generation systems can produce remarkably realistic vocalisations that are often indistinguishable from human speech, posing significant authenticity challenges. Although numerous deepfake detection methods have been developed, their effectiveness in real-world environments remains unrealiable due to the domain shift between training and test samples arising from diverse human speech and fast evolving speech synthesis systems. This is not adequately addressed by current datasets, which lack real-world application challenges with diverse and up-to-date audios in both real and deep-fake categories. To fill this gap, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale, highly diverse deepfake audio dataset for comprehensive evaluation and robust development of generalised models for deepfake audio detection. It consists of over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders with a broad range of TTS/vocoder patterns, totalling 3 million audio clips, making it the largest deepfake audio dataset by scale. Through extensive experiments with AUDETER, we reveal that i) state-of-the-art (SOTA) methods trained on existing datasets struggle to generalise to novel deepfake audio samples and suffer from high false positive rates on unseen human voice, underscoring the need for a comprehensive dataset; and ii) these methods trained on AUDETER achieve highly generalised detection performance and significantly reduce detection error rate by 44.1% to 51.6%, achieving an error rate of only 4.17% on diverse cross-domain samples in the popular In-the-Wild dataset, paving the way for training generalist deepfake audio detectors. AUDETER is available on GitHub.

[266] Denoising GER: A Noise-Robust Generative Error Correction with LLM for Speech Recognition

Yanyan Liu, Minqiang Xu, Yihao Chen, Liang He, Lei Fang, Sian Fang, Lin Liu

Main category: cs.SD

TL;DR: Proposes Denoising GER, a noise-robust multi-modal framework for ASR error correction that uses noise-adaptive acoustic encoding, heterogeneous feature fusion, and reinforcement learning to improve performance in noisy environments.

Details

Motivation: Large language models struggle with poor adaptability and low information utilization in complex noisy environments for ASR error correction, limiting their effectiveness.

Method: Uses noise-adaptive acoustic encoder, heterogeneous feature compensation dynamic fusion (HFCDF) mechanism, and reinforcement learning training strategies to enhance multi-modal information integration and predictive capabilities.

Result: Significantly improves accuracy and robustness in noisy environments and demonstrates good generalization abilities in unseen noise scenarios.

Conclusion: The Denoising GER framework effectively addresses noise-related challenges in ASR post-processing by enhancing model adaptability and multi-modal information utilization.

Abstract: In recent years, large language models (LLM) have made significant progress in the task of generation error correction (GER) for automatic speech recognition (ASR) post-processing. However, in complex noisy environments, they still face challenges such as poor adaptability and low information utilization, resulting in limited effectiveness of GER. To address these issues, this paper proposes a noise-robust multi-modal GER framework (Denoising GER). The framework enhances the model’s adaptability to different noisy scenarios through a noise-adaptive acoustic encoder and optimizes the integration of multi-modal information via a heterogeneous feature compensation dynamic fusion (HFCDF) mechanism, improving the LLM’s utilization of multi-modal information. Additionally, reinforcement learning (RL) training strategies are introduced to enhance the model’s predictive capabilities. Experimental results demonstrate that Denoising GER significantly improves accuracy and robustness in noisy environments and exhibits good generalization abilities in unseen noise scenarios.

[267] Contextualized Token Discrimination for Speech Search Query Correction

Junyu Lu, Di Jiang, Mengze Hong, Victor Junqiu Wei, Qintian Guo, Zhiyang Su

Main category: cs.SD

TL;DR: Proposes Contextualized Token Discrimination (CTD) method for speech query correction using BERT contextual representations and composition layers to fix ASR transcription errors.

Details

Motivation: Speech search through ASR systems is becoming more popular, but ASR transcriptions often contain errors that need correction to help users express their intentions clearly in search queries.

Method: Uses BERT to generate token-level contextualized representations, constructs a composition layer to enhance semantic information, and corrects incorrect tokens by comparing original and contextualized representations.

Result: Extensive experiments show superior performance across all metrics, and a new benchmark dataset with erroneous ASR transcriptions is presented for comprehensive evaluation.

Conclusion: CTD method effectively corrects speech query errors from ASR systems, demonstrating strong performance and providing a valuable benchmark for audio query correction research.

Abstract: Query spelling correction is an important function of modern search engines since it effectively helps users express their intentions clearly. With the growing popularity of speech search driven by Automated Speech Recognition (ASR) systems, this paper introduces a novel method named Contextualized Token Discrimination (CTD) to conduct effective speech query correction. In CTD, we first employ BERT to generate token-level contextualized representations and then construct a composition layer to enhance semantic information. Finally, we produce the correct query according to the aggregated token representation, correcting the incorrect tokens by comparing the original token representations and the contextualized representations. Extensive experiments demonstrate the superior performance of our proposed method across all metrics, and we further present a new benchmark dataset with erroneous ASR transcriptions to offer comprehensive evaluations for audio query correction.

[268] Beyond-Voice: Towards Continuous 3D Hand Pose Tracking on Commercial Home Assistant Devices

Yin Li, Rohan Reddy, Cheng Zhang, Rajalakshmi Nandakumar

Main category: cs.SD

TL;DR: Beyond-Voice is an acoustic sensing system that enables commodity home assistants to track and reconstruct 3D hand poses using existing microphones and speakers, eliminating the need for cameras or additional hardware.

Details

Motivation: Current home assistants rely heavily on voice interfaces which have accessibility issues, and newer models with cameras are expensive and raise privacy concerns. There's a need for a non-intrusive, cost-effective alternative for hand tracking.

Method: Transforms home assistants into active sonar systems using onboard microphones and speakers. Uses high-resolution range profiles fed to deep learning models to analyze motions and predict 3D positions of 21 finger joints without personalized training data.

Result: Achieves average mean absolute error of 16.47mm for joint tracking across different environments and users without subject-specific training data, as validated by a user study with 11 participants in 3 environments.

Conclusion: Beyond-Voice demonstrates that acoustic sensing can provide high-fidelity hand pose tracking using existing home assistant hardware, offering a privacy-preserving and cost-effective alternative to camera-based systems while maintaining cross-environment and cross-user functionality.

Abstract: The surging popularity of home assistants and their voice user interface (VUI) have made them an ideal central control hub for smart home devices. However, current form factors heavily rely on VUI, which poses accessibility and usability issues; some latest ones are equipped with additional cameras and displays, which are costly and raise privacy concerns. These concerns jointly motivate Beyond-Voice, a novel high-fidelity acoustic sensing system that allows commodity home assistant devices to track and reconstruct hand poses continuously. It transforms the home assistant into an active sonar system using its existing onboard microphones and speakers. We feed a high-resolution range profile to the deep learning model that can analyze the motions of multiple body parts and predict the 3D positions of 21 finger joints, bringing the granularity for acoustic hand tracking to the next level. It operates across different environments and users without the need for personalized training data. A user study with 11 participants in 3 different environments shows that Beyond-Voice can track joints with an average mean absolute error of 16.47mm without any training data provided by the testing subject.

[269] CoPlay: Audio-agnostic Cognitive Scaling for Acoustic Sensing

Yin Li, Bo Liu, Rajalakshmi Nanadakumar

Main category: cs.SD

TL;DR: CoPlay is a deep learning-based optimization algorithm that adapts acoustic sensing signals to work concurrently with music playback, avoiding signal overload while maintaining both sensing accuracy and music quality.

Details

Motivation: Current acoustic sensing systems face interference when speakers are used simultaneously for sensing and traditional applications like music playback, causing signal overload that degrades both sensing performance and audio quality.

Method: A deep learning model that cognitively adapts sensing signals to maximize signal magnitude within available music bandwidth while minimizing frequency distortion, tested with sine wave and FMCW signals alongside various music and speech content.

Result: Respiration monitoring and gesture recognition achieved similar accuracy to no-concurrent-music scenarios, outperforming traditional clipping/down-scaling methods. Music quality was preserved without degradation.

Conclusion: CoPlay successfully enables concurrent acoustic sensing and music playback without compromising either functionality, making acoustic sensing more practical for real-world applications.

Abstract: Acoustic sensing manifests great potential in various applications that encompass health monitoring, gesture interface and imaging by leveraging the speakers and microphones on smart devices. However, in ongoing research and development in acoustic sensing, one problem is often overlooked: the same speaker, when used concurrently for sensing and other traditional applications (like playing music), could cause interference in both making it impractical to use in the real world. The strong ultrasonic sensing signals mixed with music would overload the speaker’s mixer. To confront this issue of overloaded signals, current solutions are clipping or down-scaling, both of which affect the music playback quality and also sensing range and accuracy. To address this challenge, we propose CoPlay, a deep learning based optimization algorithm to cognitively adapt the sensing signal. It can 1) maximize the sensing signal magnitude within the available bandwidth left by the concurrent music to optimize sensing range and accuracy and 2) minimize any consequential frequency distortion that can affect music playback. In this work, we design a deep learning model and test it on common types of sensing signals (sine wave or Frequency Modulated Continuous Wave FMCW) as inputs with various agnostic concurrent music and speech. First, we evaluated the model performance to show the quality of the generated signals. Then we conducted field studies of downstream acoustic sensing tasks in the real world. A study with 12 users proved that respiration monitoring and gesture recognition using our adapted signal achieve similar accuracy as no-concurrent-music scenarios, while clipping or down-scaling manifests worse accuracy. A qualitative study also manifests that the music play quality is not degraded, unlike traditional clipping or down-scaling methods.

[270] Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li

Main category: cs.SD

TL;DR: A dual-stream neural framework for generating synchronized bimanual piano hand gestures from audio, using decoupled diffusion with dual-noise initialization and Hand-Coordinated Asymmetric Attention to model both hand independence and coordination.

Details

Motivation: Automating synthesis of coordinated bimanual piano performances is challenging due to the need to capture intricate hand choreography while preserving distinct kinematic signatures of each hand.

Method: Proposes a dual-stream neural framework with: (1) decoupled diffusion-based generation with dual-noise initialization for independent hand modeling, and (2) Hand-Coordinated Asymmetric Attention mechanism to suppress symmetric noise and enhance inter-hand coordination.

Result: Comprehensive evaluations show the framework outperforms existing state-of-the-art methods across multiple metrics.

Conclusion: The proposed framework successfully addresses the challenge of modeling both hand independence and coordination in piano performance synthesis, achieving superior performance compared to current methods.

Abstract: Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand’s motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics. Our project is available at https://monkek123king.github.io/S2C_page/.

[271] Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Or Tal, Felix Kreuk, Yossi Adi

Main category: cs.SD

TL;DR: Systematic comparison of auto-regressive vs flow-matching paradigms for text-to-music generation using identical training conditions to isolate modeling paradigm effects.

Details

Motivation: The diversity in SOTA text-to-music systems makes fair evaluation difficult. This study aims to isolate the effects of modeling paradigm choice (auto-regressive vs flow-matching) to understand trade-offs and guide future system design.

Method: Controlled comparison training both paradigms from scratch using identical datasets, training configurations, and similar backbone architectures. Evaluated across generation quality, robustness, scalability, conditioning adherence, and audio inpainting capabilities.

Result: The study reveals distinct strengths and limitations of each paradigm, providing insights into their performance across multiple evaluation axes including quality, inference robustness, and editing capabilities.

Conclusion: The comparative analysis offers actionable insights to inform future architectural and training decisions in text-to-music generation, highlighting the importance of modeling paradigm selection.

Abstract: Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

[272] AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation

Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam, Jeong Mi Park

Main category: cs.SD

TL;DR: AImoclips benchmark evaluates text-to-music systems’ emotional fidelity, finding commercial models produce overly pleasant music while open-source models under-deliver on emotion, with all systems biased toward emotional neutrality.

Details

Motivation: Text-to-music generation has advanced but emotional fidelity remains underexplored compared to human preference or text alignment, creating a need to evaluate how well these systems convey intended emotions.

Method: Created AImoclips benchmark with 12 emotion intents across valence-arousal space, generated 1,000+ music clips using 6 state-of-the-art TTM systems, and had 111 participants rate perceived valence/arousal on 9-point Likert scale.

Result: Commercial systems produce music perceived as more pleasant than intended, open-source systems tend to perform opposite. Emotions more accurately conveyed under high-arousal conditions. All systems show bias toward emotional neutrality.

Conclusion: The benchmark reveals model-specific emotion rendering characteristics and limitations in affective controllability, providing valuable insights for developing emotionally aligned text-to-music systems.

Abstract: Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.

[273] EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection

Yun Chu, Qiuhao Wang, Enze Zhou, Qian Liu, Gang Zheng

Main category: cs.SD

TL;DR: Proposes a graph neural network framework with anchor intervals for respiratory sound event detection that handles variable-length audio and provides precise temporal localization of abnormal sounds.

Details

Motivation: Existing respiratory sound event detection methods rely on frame-level predictions with post-processing, struggle with interval boundaries, handle only fixed-length audio, and don't adequately explore location information impact.

Method: Graph neural network-based framework with anchor intervals that can process variable-length audio and incorporate respiratory position information for better discrimination.

Result: Experiments on SPRSound 2024 and HF Lung V1 datasets demonstrate effectiveness, with respiratory position information enhancing abnormal sound discrimination.

Conclusion: The proposed approach improves flexibility and applicability of respiratory sound detection, providing more precise temporal localization for abnormal respiratory sound events.

Abstract: Auscultation is a key method for early diagnosis of respiratory and pulmonary diseases, relying on skilled healthcare professionals. However, the process is often subjective, with variability between experts. As a result, numerous deep learning-based automatic classification methods have emerged, most of which focus on respiratory sound classification. In contrast, research on respiratory sound event detection remains limited. Existing sound event detection methods typically rely on frame-level predictions followed by post-processing to generate event-level outputs, making interval boundaries challenging to learn directly. Furthermore, many approaches can only handle fixed-length audio, limiting their applicability to variable-length respiratory sounds. Additionally, the impact of respiratory sound location information on detection performance has not been extensively explored. To address these issues, we propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio and providing more precise temporal localization for abnormal respiratory sound events. Our method improves both the flexibility and applicability of respiratory sound detection. Experiments on the SPRSound 2024 and HF Lung V1 datasets demonstrate the effectiveness of the proposed approach, and incorporating respiratory position information enhances the discrimination between abnormal sounds. The reference implementation is available at https://github.com/chumingqian/EzhouNet.

[274] FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu

Main category: cs.SD

TL;DR: FireRedTTS-2 is a streaming multi-speaker dialogue TTS system that enables real-time interactive chat with stable synthesis, accurate speaker switching, and context-aware prosody using a novel 12.5Hz tokenizer and dual-transformer architecture.

Details

Motivation: Current dialogue generation systems require complete dialogue text upfront, produce inseparable multi-speaker speech, and suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody, making them unsuitable for interactive applications.

Method: Uses a new 12.5Hz streaming speech tokenizer for faster training/inference and richer semantics. Adopts text-speech interleaved format with speaker-labeled text and aligned tokens. Employs dual-transformer architecture: large decoder-only transformer for first-layer token prediction and smaller transformer for subsequent layers.

Result: Seamlessly integrates with chat frameworks, produces emotionally expressive speech with minimal fine-tuning, and surpasses MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in intelligibility, speaker-turn reliability, and naturalness with context-consistent prosody.

Conclusion: FireRedTTS-2 enables real-time multi-speaker dialogue generation with stable, natural speech output, reliable speaker switching, and context-aware prosody, making it suitable for interactive applications like chat and podcast generation.

Abstract: Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

[275] AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang

Main category: cs.SD

TL;DR: This paper addresses challenges in audio tokenization for Multimodal Large Language Models by providing clear definitions for semantic and acoustic tokens, and introducing a comprehensive evaluation framework across multiple dimensions.

Details

Motivation: Existing research lacks suitable definitions for semantic and acoustic tokens in audio tokenization, and current evaluations are limited to specific domains or tasks, preventing fair and comprehensive comparisons of different codecs.

Method: The paper provides appropriate definitions for semantic and acoustic tokens and introduces a systematic evaluation framework that assesses codecs across four dimensions: audio reconstruction metrics, codebook index stability, decoder-only transformer perplexity, and performance on downstream probe tasks.

Result: The results demonstrate the correctness of the provided definitions and reveal correlations among reconstruction metrics, codebook ID stability, downstream probe tasks performance, and perplexity measurements.

Conclusion: The proposed framework enables comprehensive assessment of audio codecs for MLLMs, addressing previous limitations in token definitions and evaluation methods, and establishing important correlations between different evaluation dimensions.

Abstract: Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details. Moreover, they provide a discrete method for speech and music that can be effectively integrated into MLLMs. However, existing research is unsuitable in the definitions of semantic tokens and acoustic tokens. In addition, the evaluation of different codecs typically concentrates on specific domains or tasks, such as reconstruction or Automatic Speech Recognition (ASR) task, which prevents fair and comprehensive comparisons. To address these problems, this paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework. This framework allows for a comprehensive assessment of codecs’ capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, decoder-only transformer perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.

cs.LG

[276] Learning an Adversarial World Model for Automated Curriculum Generation in MARL

Brennen Hill

Main category: cs.LG

TL;DR: Adversarial co-evolution framework where an Attacker agent learns to generate increasingly difficult challenges to exploit Defender weaknesses, while Defenders learn cooperative policies, creating a self-scaling curriculum for robust agent development.

Details

Motivation: Overcome limitations of hand-crafted training environments by developing scalable environments that grow in complexity alongside learning agents, enabling truly generalizable and robust embodied intelligence.

Method: Goal-conditioned generative world model where an Attacker agent synthesizes challenging world states (enemy unit configurations) to exploit Defender weaknesses, while Defender team learns cooperative policies through this adversarial co-evolution.

Result: Emergence of complex behaviors including flanking/shielding formations by the world model and coordinated focus-fire/spreading tactics by defenders, demonstrating strategic depth and robustness.

Conclusion: Adversarial co-evolution is a powerful method for learning instrumental world models that drive agents toward greater strategic capabilities through continuous, adaptive challenge generation.

Abstract: World models that infer and predict environmental dynamics are foundational to embodied intelligence. However, their potential is often limited by the finite complexity and implicit biases of hand-crafted training environments. To develop truly generalizable and robust agents, we need environments that scale in complexity alongside the agents learning within them. In this work, we reframe the challenge of environment generation as the problem of learning a goal-conditioned, generative world model. We propose a system where a generative Attacker agent learns an implicit world model to synthesize increasingly difficult challenges for a team of cooperative Defender agents. The Attacker’s objective is not passive prediction, but active, goal-driven interaction: it models and generates world states (i.e., configurations of enemy units) specifically to exploit the Defenders’ weaknesses. Concurrently, the embodied Defender team learns a cooperative policy to overcome these generated worlds. This co-evolutionary dynamic creates a self-scaling curriculum where the world model continuously adapts to challenge the decision-making policy of the agents, providing an effectively infinite stream of novel and relevant training scenarios. We demonstrate that this framework leads to the emergence of complex behaviors, such as the world model learning to generate flanking and shielding formations, and the defenders learning coordinated focus-fire and spreading tactics. Our findings position adversarial co-evolution as a powerful method for learning instrumental world models that drive agents toward greater strategic depth and robustness.

[277] The Optimiser Hidden in Plain Sight: Training with the Loss Landscape’s Induced Metric

Thomas R. Harvey

Main category: cs.LG

TL;DR: Novel Riemannian geometry-based optimizers for neural networks that use loss landscape embedding metrics, showing effectiveness in low dimensions and slight improvements over state-of-the-art methods.

Details

Motivation: To leverage the Riemannian metric naturally induced when embedding loss landscapes in higher-dimensional space, which underlies common visualizations of loss landscapes, to develop more effective optimization methods.

Method: Developed a new class of optimizers using the induced Riemannian metric from loss landscape embeddings. Compared against SGD, Adam, AdamW, and Muon across various tasks and architectures. The method can modify any existing preconditioning approach.

Result: Highly effective in low-dimensional examples and provides slight improvement over state-of-the-art methods. The optimizers have computational complexity comparable to Adam and offer automatic learning rate reduction in high-curvature regions (acting like smoothed gradient clipping).

Conclusion: The Riemannian geometry-based approach yields optimizers with theoretically desirable properties, including natural decoupled weight decay and effective learning rate scheduling, while maintaining competitive performance and computational efficiency.

Abstract: We present a class of novel optimisers for training neural networks that makes use of the Riemannian metric naturally induced when the loss landscape is embedded in higher-dimensional space. This is the same metric that underlies common visualisations of loss landscapes. By taking this geometric perspective literally and using the induced metric, we develop a new optimiser and compare it to existing methods, namely: SGD, Adam, AdamW, and Muon, across a range of tasks and architectures. Empirically, we conclude that this new class of optimisers is highly effective in low dimensional examples, and provides slight improvement over state-of-the-art methods for training neural networks. These new optimisers have theoretically desirable properties. In particular, the effective learning rate is automatically decreased in regions of high curvature acting as a smoothed out form of gradient clipping. Similarly, one variant of these optimisers can also be viewed as inducing an effective scheduled learning rate and decoupled weight decay is the natural choice from our geometric perspective. The basic method can be used to modify any existing preconditioning method. The new optimiser has a computational complexity comparable to that of Adam.

[278] CEHR-GPT: A Scalable Multi-Task Foundation Model for Electronic Health Records

Chao Pang, Jiheum Park, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Shalmali Joshi, Noémie Elhadad, Karthik Natarajan

Main category: cs.LG

TL;DR: CEHR-GPT is a general-purpose foundation model for EHR data that unifies feature representation, zero-shot prediction, and synthetic data generation in a single architecture with time-token-based learning for temporal reasoning.

Details

Motivation: Most AI models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world clinical settings.

Method: Developed CEHR-GPT with a novel time-token-based learning framework that explicitly encodes patients’ dynamic timelines into the model structure, enabling temporal reasoning over clinical sequences.

Result: CEHR-GPT demonstrates strong performance across all three tasks (feature representation, zero-shot prediction, synthetic data generation) and generalizes effectively to external datasets through vocabulary expansion and fine-tuning.

Conclusion: The model enables rapid model development, cohort discovery, and patient outcome forecasting without task-specific retraining, making it versatile for real-world healthcare applications.

Abstract: Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-GPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, \cehrgpt{} incorporates a novel time-token-based learning framework that explicitly encodes patients' dynamic timelines into the model structure. CEHR-GPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.

[279] Nonnegative matrix factorization and the principle of the common cause

E. Khalafyan, A. E. Allahverdyan, A. Hovhannisyan

Main category: cs.LG

TL;DR: NMF and PCC are closely related - PCC helps estimate stable NMF rank and resolve nonidentifiability, while NMF enables approximate PCC implementation for clustering and denoising.

Details

Motivation: To explore the reciprocal relationship between Nonnegative Matrix Factorization (NMF) and the Principle of the Common Cause (PCC) in probabilistic causality, and leverage their connection for robust data analysis.

Method: Used PCC as predictability tool to estimate NMF rank, applied to gray-scale image datasets mapped into probability models. Developed clustering method grouping data points with same common cause.

Result: PCC-based rank estimation is stable against weak noise and local optimization seeds. NMF produces stable features and resolves nonidentifiability. NMF enables approximate PCC implementation for effective clustering and denoising.

Conclusion: NMF and PCC have strong reciprocal benefits - PCC provides robust rank estimation for NMF, while NMF offers practical implementation of PCC for clustering and noise reduction in data analysis.

Abstract: Nonnegative matrix factorization (NMF) is a known unsupervised data-reduction method. The principle of the common cause (PCC) is a basic methodological approach in probabilistic causality, which seeks an independent mixture model for the joint probability of two dependent random variables. It turns out that these two concepts are closely related. This relationship is explored reciprocally for several datasets of gray-scale images, which are conveniently mapped into probability models. On one hand, PCC provides a predictability tool that leads to a robust estimation of the effective rank of NMF. Unlike other estimates (e.g., those based on the Bayesian Information Criteria), our estimate of the rank is stable against weak noise. We show that NMF implemented around this rank produces features (basis images) that are also stable against noise and against seeds of local optimization, thereby effectively resolving the NMF nonidentifiability problem. On the other hand, NMF provides an interesting possibility of implementing PCC in an approximate way, where larger and positively correlated joint probabilities tend to be explained better via the independent mixture model. We work out a clustering method, where data points with the same common cause are grouped into the same cluster. We also show how NMF can be employed for data denoising.

[280] Semi-decentralized Federated Time Series Prediction with Client Availability Budgets

Yunkai Bao, Reza Safarzadeh, Xin Wang, Steve Drew

Main category: cs.LG

TL;DR: FedDeCAB is a semi-decentralized client selection method for federated learning that handles client disconnections by obtaining partial model parameters from nearest neighbors, improving performance and reducing communication overhead in IoT scenarios with time-series data.

Details

Motivation: Federated learning in IoT scenarios faces challenges with data heterogeneity, limited energy budgets, and client availability issues. Effective client selection is crucial for global model convergence and balancing client contributions, especially with time-series data where client availability patterns impact performance.

Method: Proposed FedDeCAB - a novel semi-decentralized client selection method using probabilistic rankings of available clients. When clients disconnect from the server, it allows obtaining partial model parameters from nearest neighbor clients for joint optimization.

Result: Experiments on real-world large-scale taxi and vessel trajectory datasets show FedDeCAB is effective under highly heterogeneous data distribution, limited communication budget, and dynamic client offline/rejoining scenarios.

Conclusion: FedDeCAB successfully addresses client availability challenges in federated learning with time-series data, improving offline model performance while reducing communication overhead in dynamic IoT environments.

Abstract: Federated learning (FL) effectively promotes collaborative training among distributed clients with privacy considerations in the Internet of Things (IoT) scenarios. Despite of data heterogeneity, FL clients may also be constrained by limited energy and availability budgets. Therefore, effective selection of clients participating in training is of vital importance for the convergence of the global model and the balance of client contributions. In this paper, we discuss the performance impact of client availability with time-series data on federated learning. We set up three different scenarios that affect the availability of time-series data and propose FedDeCAB, a novel, semi-decentralized client selection method applying probabilistic rankings of available clients. When a client is disconnected from the server, FedDeCAB allows obtaining partial model parameters from the nearest neighbor clients for joint optimization, improving the performance of offline models and reducing communication overhead. Experiments based on real-world large-scale taxi and vessel trajectory datasets show that FedDeCAB is effective under highly heterogeneous data distribution, limited communication budget, and dynamic client offline or rejoining.

[281] AutoGrid AI: Deep Reinforcement Learning Framework for Autonomous Microgrid Management

Kenny Guo, Nicholas Eckhert, Krish Chhajer, Luthira Abeykoon, Lorne Schell

Main category: cs.LG

TL;DR: Deep RL framework for autonomous microgrid management using transformer forecasting and PPO agent to optimize renewable energy dispatch, showing significant improvements over traditional methods.

Details

Motivation: To develop autonomous microgrid management for remote communities, optimizing renewable energy utilization and minimizing costs in pursuit of zero-carbon energy systems.

Method: Combines deep reinforcement learning with time-series forecasting using transformer architecture for renewable generation prediction and proximal-policy optimization (PPO) agent for decision-making in simulated microgrid environments.

Result: Experimental results demonstrate significant improvements in both energy efficiency and operational resilience compared to traditional rule-based methods.

Conclusion: The work contributes to advancing smart-grid technologies and provides an open-source framework for simulating microgrid environments, supporting the transition to zero-carbon energy systems.

Abstract: We present a deep reinforcement learning-based framework for autonomous microgrid management. tailored for remote communities. Using deep reinforcement learning and time-series forecasting models, we optimize microgrid energy dispatch strategies to minimize costs and maximize the utilization of renewable energy sources such as solar and wind. Our approach integrates the transformer architecture for forecasting of renewable generation and a proximal-policy optimization (PPO) agent to make decisions in a simulated environment. Our experimental results demonstrate significant improvements in both energy efficiency and operational resilience when compared to traditional rule-based methods. This work contributes to advancing smart-grid technologies in pursuit of zero-carbon energy systems. We finally provide an open-source framework for simulating several microgrid environments.

[282] SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

Arpan Mukherjee, Marcello Bullo, Deniz Gündüz

Main category: cs.LG

TL;DR: SharedRep-RLHF is a new framework that addresses limitations of MaxMin-RLHF by learning shared traits across groups instead of separate reward models, improving performance especially for minority groups.

Details

Motivation: Standard RLHF fails to capture diverse opinions across sub-populations and favors dominant groups. MaxMin-RLHF addresses fairness but performs poorly when the minimum-reward group is a minority.

Method: SharedRep-RLHF learns and leverages shared traits in annotations among various groups, rather than learning separate group-specific reward models like MaxMin-RLHF.

Result: Experiments show SharedRep-RLHF outperforms MaxMin-RLHF with up to 20% gain in win rate across diverse natural language tasks.

Conclusion: Learning shared traits across groups is more effective than separate group-specific models for fair RLHF, particularly benefiting minority groups.

Abstract: Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed {\em SharedRep-RLHF}. At its core, SharedRep-RLHF learns and leverages {\em shared traits} in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.

[283] A Machine Learning-Based Study on the Synergistic Optimization of Supply Chain Management and Financial Supply Chains from an Economic Perspective

Hang Wang, Huijie Tang, Ningai Leng, Zhoufan Yu

Main category: cs.LG

TL;DR: A machine learning-powered SCM-FSCM model combining economic theories with algorithms like random forests, LSTM, and XGBoost to improve supply chain efficiency, reduce financing costs, and optimize inventory management.

Details

Motivation: To address efficiency loss, financing constraints, and risk transmission in supply chains by integrating economic theories with machine learning technology.

Method: Combines Transaction Cost and Information Asymmetry theories with machine learning algorithms. Uses random forests for multi-dimensional data processing, LSTM for demand forecasting, clustering/regression for benefit allocation, Game Theory with reinforcement learning for inventory optimization, and XGBoost for credit assessment.

Result: 30% increase in inventory turnover, 18%-22% decrease in SME financing costs, stable order fulfillment rate above 95%, demand forecasting error <= 8%, credit assessment accuracy >= 90%. Verified with 20 core and 100 supporting enterprises.

Conclusion: The SCM-FSCM model effectively reduces operating costs, alleviates financing constraints, and supports high-quality supply chain development through data-driven optimization.

Abstract: Based on economic theories and integrated with machine learning technology, this study explores a collaborative Supply Chain Management and Financial Supply Chain Management (SCM - FSCM) model to solve issues like efficiency loss, financing constraints, and risk transmission. We combine Transaction Cost and Information Asymmetry theories and use algorithms such as random forests to process multi-dimensional data and build a data-driven, three-dimensional (cost-efficiency-risk) analysis framework. We then apply an FSCM model of “core enterprise credit empowerment plus dynamic pledge financing.” We use Long Short-Term Memory (LSTM) networks for demand forecasting and clustering/regression algorithms for benefit allocation. The study also combines Game Theory and reinforcement learning to optimize the inventory-procurement mechanism and uses eXtreme Gradient Boosting (XGBoost) for credit assessment to enable rapid monetization of inventory. Verified with 20 core and 100 supporting enterprises, the results show a 30% increase in inventory turnover, an 18%-22% decrease in SME financing costs, a stable order fulfillment rate above 95%, and excellent model performance (demand forecasting error <= 8%, credit assessment accuracy >= 90%). This SCM-FSCM model effectively reduces operating costs, alleviates financing constraints, and supports high-quality supply chain development.

[284] Insights from Gradient Dynamics: Gradient Autoscaled Normalization

Vincent-Daniel Yun

Main category: cs.LG

TL;DR: Empirical analysis of gradient variance evolution leads to hyperparameter-free normalization method that stabilizes optimization and maintains/improves test accuracy on CIFAR-100.

Details

Motivation: Gradient dynamics are crucial for stability and generalization in deep neural networks, but there's a gap between theoretical expectations and empirical behaviors that needs bridging.

Method: Proposed hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution during training, preventing unintended amplification while preserving convergence guarantees.

Result: Experiments on CIFAR-100 with ResNet-20, ResNet-56, and VGG-16-BN show maintained or improved test accuracy even under strong generalization conditions.

Conclusion: Direct tracking of gradient dynamics is important for optimization research, and the proposed method provides stable optimization while bridging theoretical-empirical gaps.

Abstract: Gradient dynamics play a central role in determining the stability and generalization of deep neural networks. In this work, we provide an empirical analysis of how variance and standard deviation of gradients evolve during training, showing consistent changes across layers and at the global scale in convolutional networks. Motivated by these observations, we propose a hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution. This approach prevents unintended amplification, stabilizes optimization, and preserves convergence guarantees. Experiments on the challenging CIFAR-100 benchmark with ResNet-20, ResNet-56, and VGG-16-BN demonstrate that our method maintains or improves test accuracy even under strong generalization. Beyond practical performance, our study highlights the importance of directly tracking gradient dynamics, aiming to bridge the gap between theoretical expectations and empirical behaviors, and to provide insights for future optimization research.

[285] A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games

Zhengyang Li, Qijin Ji, Xinghong Ling, Quan Liu

Main category: cs.LG

TL;DR: Comprehensive review of MARL applications in video games from turn-based to real-time genres, analyzing challenges and successful implementations while proposing game complexity estimation and future research directions.

Details

Motivation: Recent advancements in MARL have shown superhuman performance in games like AlphaStar and OpenAI Five, creating need for comprehensive review of applications across diverse game environments.

Method: Thorough examination of MARL applications across game genres (Sports, FPS, RTS, MOBA), analysis of critical challenges (nonstationary, partial observability, sparse rewards, coordination, scalability), and review of successful implementations in various games.

Result: Provides insights into MARL in video game AI systems, proposes novel method to estimate game complexity, and highlights successful MARL implementations across multiple game titles and genres.

Conclusion: MARL has proven capable of achieving superhuman performance in diverse games through techniques like self-play and deep RL; paper suggests future research directions to advance MARL applications in game development and inspire further innovation.

Abstract: Recent advancements in multi-agent reinforcement learning (MARL) have demonstrated its application potential in modern games. Beginning with foundational work and progressing to landmark achievements such as AlphaStar in StarCraft II and OpenAI Five in Dota 2, MARL has proven capable of achieving superhuman performance across diverse game environments through techniques like self-play, supervised learning, and deep reinforcement learning. With its growing impact, a comprehensive review has become increasingly important in this field. This paper aims to provide a thorough examination of MARL’s application from turn-based two-agent games to real-time multi-agent video games including popular genres such as Sports games, First-Person Shooter (FPS) games, Real-Time Strategy (RTS) games and Multiplayer Online Battle Arena (MOBA) games. We further analyze critical challenges posed by MARL in video games, including nonstationary, partial observability, sparse rewards, team coordination, and scalability, and highlight successful implementations in games like Rocket League, Minecraft, Quake III Arena, StarCraft II, Dota 2, Honor of Kings, etc. This paper offers insights into MARL in video game AI systems, proposes a novel method to estimate game complexity, and suggests future research directions to advance MARL and its applications in game development, inspiring further innovation in this rapidly evolving field.

[286] Graph Random Features for Scalable Gaussian Processes

Matthew Zhang, Jihao Andreas Lin, Adrian Weller, Richard E. Turner, Isaac Reid

Main category: cs.LG

TL;DR: Graph Random Features enable scalable Gaussian processes on graphs with O(N^{3/2}) time complexity instead of O(N^3), allowing Bayesian optimization on graphs with over 1M nodes on a single chip.

Details

Motivation: To overcome the computational limitations of exact graph node kernels which have O(N^3) time complexity, making Bayesian inference and optimization impractical for large graphs.

Method: Apply Graph Random Features (GRFs) as stochastic estimators of graph node kernels for scalable Gaussian processes on discrete input spaces.

Result: Achieves substantial wall-clock speedups and memory savings, enabling Bayesian optimization on graphs with over 1,000,000 nodes on a single computer chip while maintaining competitive performance.

Conclusion: GRFs provide an efficient and practical solution for scalable Bayesian inference on large graphs, significantly reducing computational complexity from O(N^3) to O(N^{3/2}) while preserving performance.

Abstract: We study the application of graph random features (GRFs) - a recently introduced stochastic estimator of graph node kernels - to scalable Gaussian processes on discrete input spaces. We prove that (under mild assumptions) Bayesian inference with GRFs enjoys $O(N^{3/2})$ time complexity with respect to the number of nodes $N$, compared to $O(N^3)$ for exact kernels. Substantial wall-clock speedups and memory savings unlock Bayesian optimisation on graphs with over $10^6$ nodes on a single computer chip, whilst preserving competitive performance.

Payam Abdisarabshali, Fardis Nadimi, Kasra Borazjani, Naji Khosravan, Minghui Liwang, Wei Ni, Dusit Niyato, Michael Langberg, Seyyedali Hosseinalipour

Main category: cs.LG

TL;DR: Proposes hierarchical federated foundation models (HF-FMs) that address modality and task heterogeneity in fog/edge networks by aligning multi-modal multi-task foundation models with hierarchical network infrastructure.

Details

Motivation: The rise of foundation models and need to leverage geo-distributed data from wireless devices has created federated foundation models. The evolution to multi-modal multi-task FMs like GPT-4 motivates exploring M3T FFMs, particularly addressing two overlooked heterogeneity dimensions in fog/edge networks.

Method: HF-FMs strategically align the modular structure of M3T FMs (modality encoders, prompts, MoEs, adapters, task heads) with hierarchical fog/edge infrastructures. They enable optional D2D communications for horizontal module relaying and localized cooperative training.

Result: The paper presents the architectural design of HF-FMs, highlighting their unique capabilities. A prototype was implemented in a wireless network setting and open-source code was released to foster further exploration.

Conclusion: HF-FMs represent a novel paradigm that addresses critical heterogeneity challenges in fog/edge networks for multi-modal multi-task foundation models, with promising potential for wireless network applications.

Abstract: The rise of foundation models (FMs) has reshaped the landscape of machine learning. As these models continued to grow, leveraging geo-distributed data from wireless devices has become increasingly critical, giving rise to federated foundation models (FFMs). More recently, FMs have evolved into multi-modal multi-task (M3T) FMs (e.g., GPT-4) capable of processing diverse modalities across multiple tasks, which motivates a new underexplored paradigm: M3T FFMs. In this paper, we unveil an unexplored variation of M3T FFMs by proposing hierarchical federated foundation models (HF-FMs), which in turn expose two overlooked heterogeneity dimensions to fog/edge networks that have a direct impact on these emerging models: (i) heterogeneity in collected modalities and (ii) heterogeneity in executed tasks across fog/edge nodes. HF-FMs strategically align the modular structure of M3T FMs, comprising modality encoders, prompts, mixture-of-experts (MoEs), adapters, and task heads, with the hierarchical nature of fog/edge infrastructures. Moreover, HF-FMs enable the optional usage of device-to-device (D2D) communications, enabling horizontal module relaying and localized cooperative training among nodes when feasible. Through delving into the architectural design of HF-FMs, we highlight their unique capabilities along with a series of tailored future research directions. Finally, to demonstrate their potential, we prototype HF-FMs in a wireless network setting and release the open-source code for the development of HF-FMs with the goal of fostering exploration in this untapped field (GitHub: https://github.com/payamsiabd/M3T-FFM).

[288] Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

Jules Cauzinille, Marius Miron, Olivier Pietquin, Masato Hagiwara, Ricard Marxer, Arnaud Rey, Benoit Favre

Main category: cs.LG

TL;DR: Self-supervised speech models like HuBERT, WavLM, and XEUS show strong performance on bioacoustic tasks, generating rich representations of animal sounds and achieving competitive results with specialized bioacoustic models.

Details

Motivation: Self-supervised speech models have shown impressive performance in speech processing, but their effectiveness on non-speech data like bioacoustic signals remains underexplored. The study aims to investigate their transfer learning capabilities on bioacoustic detection and classification tasks.

Method: The researchers analyzed speech models’ properties using linear probing on time-averaged representations, extended the approach with downstream architectures to account for time-wise information, and studied the effects of frequency range and noise on performance.

Result: The models generated rich latent representations of animal sounds across different taxa. Results were competitive with fine-tuned bioacoustic pre-trained models and demonstrated the impact of noise-robust pre-training setups.

Conclusion: Speech-based self-supervised learning shows great potential as an efficient framework for advancing bioacoustic research, with models effectively transferring to non-speech audio domains.

Abstract: Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.

[289] EmbedOR: Provable Cluster-Preserving Visualizations with Curvature-Based Stochastic Neighbor Embeddings

Tristan Luca Saidi, Abigail Hickok, Bastian Rieck, Andrew J. Blumberg

Main category: cs.LG

TL;DR: EmbedOR is a new SNE algorithm that uses discrete graph curvature to better preserve data geometry and prevent spurious fragmentation in visualizations compared to UMAP and tSNE.

Details

Motivation: Existing SNE algorithms like UMAP and tSNE often fail to preserve the true geometry of noisy high-dimensional data, spuriously separating connected components and missing clusters in well-clusterable data.

Method: Proposes EmbedOR algorithm that incorporates discrete graph curvature and uses a curvature-enhanced distance metric to emphasize underlying cluster structure during stochastic embedding.

Result: Proven theoretical consistency extension for tSNE results, extensive experiments show EmbedOR preserves geometry better and avoids fragmenting continuous high-density regions unlike other SNE methods.

Conclusion: EmbedOR provides superior geometry preservation and can also serve as a tool to annotate existing visualizations to identify fragmentation and gain deeper insights into data geometry.

Abstract: Stochastic Neighbor Embedding (SNE) algorithms like UMAP and tSNE often produce visualizations that do not preserve the geometry of noisy and high dimensional data. In particular, they can spuriously separate connected components of the underlying data submanifold and can fail to find clusters in well-clusterable data. To address these limitations, we propose EmbedOR, a SNE algorithm that incorporates discrete graph curvature. Our algorithm stochastically embeds the data using a curvature-enhanced distance metric that emphasizes underlying cluster structure. Critically, we prove that the EmbedOR distance metric extends consistency results for tSNE to a much broader class of datasets. We also describe extensive experiments on synthetic and real data that demonstrate the visualization and geometry-preservation capabilities of EmbedOR. We find that, unlike other SNE algorithms and UMAP, EmbedOR is much less likely to fragment continuous, high-density regions of the data. Finally, we demonstrate that the EmbedOR distance metric can be used as a tool to annotate existing visualizations to identify fragmentation and provide deeper insight into the underlying geometry of the data.

[290] Online Learning of Optimal Sequential Testing Policies

Qiyuan Chen, Raed Al Kontar

Main category: cs.LG

TL;DR: Online Testing Problem (OTP) with correlated, costly tests where missing data from partial testing creates bias, requiring Ω(T^{2/3}) minimax regret vs Θ(√T) in standard MDPs. An Explore-Then-Commit algorithm achieves Õ(T^{2/3}) regret, while a variant with missingness-independent rewards enables Õ(√T) regret.

Details

Motivation: To develop optimal sequential testing policies for subjects when tests are correlated and costly, addressing the challenge of learning from partial/missing test data that biases estimates compared to standard MDP settings.

Method: Formulated as Online Testing Problem (OTP) with missing data bias. Developed Explore-Then-Commit algorithm for OTP achieving Õ(T^{2/3}) regret. Also studied variant with missingness-independent rewards using iterative-elimination algorithm achieving Õ(√T) regret.

Result: Proved Ω(T^{2/3}) minimax regret lower bound for OTP (vs Θ(√T) in episodic MDPs). Matched this with Õ(T^{2/3}) regret algorithm. For missingness-independent reward variant, achieved Õ(√T) regret breaking the lower bound. Numerical results confirm both theories.

Conclusion: Missing data fundamentally increases exploration-exploitation difficulty in sequential testing, requiring different regret scaling. The work provides theoretical understanding and practical algorithms for efficient testing policies under missing data constraints.

Abstract: This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as $\Omega(T^{\frac{2}{3}})$, in contrast to the $\Theta(\sqrt{T})$ rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is $\tilde{O}(T^{\frac{2}{3}})$ for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, breaking the $\Omega(T^{\frac{2}{3}})$ lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration–exploitation trade-off under missing data and guides the design of efficient sequential testing policies.

[291] From Federated Learning to $\mathbb{X}$-Learning: Breaking the Barriers of Decentrality Through Random Walks

Allan Salihovic, Payam Abdisarabshali, Michael Langberg, Seyyedali Hosseinalipour

Main category: cs.LG

TL;DR: X-Learning is a novel distributed learning architecture that generalizes decentralization concepts, with connections to graph theory and Markov chains.

Details

Motivation: To present a vision for X-Learning and introduce its unexplored design considerations and degrees of freedom in distributed learning systems.

Method: The paper provides a perspective on X-Learning architecture, exploring intuitive but non-trivial connections to graph theory and Markov chains.

Result: The analysis reveals new design considerations and degrees of freedom in distributed learning, establishing connections between X-Learning and fundamental mathematical concepts.

Conclusion: X-Learning represents a promising extension of decentralized learning architectures, with several open research directions identified for future investigation.

Abstract: We provide our perspective on $\mathbb{X}$-Learning ($\mathbb{X}$L), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for $\mathbb{X}$L, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between $\mathbb{X}$L, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.

[292] Differentiable Entropy Regularization for Geometry and Neural Networks

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: Differentiable estimator for range-partition entropy enables adaptive deep learning with efficiency gains in both computational geometry (4.1× speedups) and Transformer attention (6% higher accuracy at 80% sparsity).

Details

Motivation: Range-partition entropy provides strong theoretical guarantees in algorithm design but hasn't been accessible to deep learning. The goal is to bridge this gap and enable algorithms to adapt to input 'sortedness'.

Method: Proposed differentiable approximation of range-partition entropy, designed EntropyNet neural module for data restructuring, and applied entropy regularization to Transformer attention mechanisms.

Result: Achieved 4.1× runtime speedups with <0.2% error in geometry tasks, and 6% higher accuracy at 80% sparsity compared to L1 baselines in deep learning applications.

Conclusion: Entropy-bounded computation is both theoretically elegant and practically effective for adaptive learning, efficiency improvements, and structured representation learning.

Abstract: We introduce a differentiable estimator of range-partition entropy, a recent concept from computational geometry that enables algorithms to adapt to the “sortedness” of their input. While range-partition entropy provides strong guarantees in algorithm design, it has not yet been made accessible to deep learning. In this work, we (i) propose the first differentiable approximation of range-partition entropy, enabling its use as a trainable loss or regularizer; (ii) design EntropyNet, a neural module that restructures data into low-entropy forms to accelerate downstream instance-optimal algorithms; and (iii) extend this principle beyond geometry by applying entropy regularization directly to Transformer attention. Across tasks, we demonstrate that differentiable entropy improves efficiency without degrading correctness: in geometry, our method achieves up to $4.1\times$ runtime speedups with negligible error ($<0.2%$); in deep learning, it induces structured attention patterns that yield 6% higher accuracy at 80% sparsity compared to L1 baselines. Our theoretical analysis provides approximation bounds for the estimator, and extensive ablations validate design choices. These results suggest that entropy-bounded computation is not only theoretically elegant but also a practical mechanism for adaptive learning, efficiency, and structured representation.

[293] Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces

Bahareh Tolooshams, Ailsa Shen, Anima Anandkumar

Main category: cs.LG

TL;DR: A framework extending sparse autoencoders to lifted spaces and infinite-dimensional function spaces for mechanistic interpretability of neural operators, showing improved recovery of smooth concepts and robust inference across resolutions.

Details

Motivation: To address the underexplored representational properties of neural operators in scientific computing and unify representations through sparse model recovery, building on the Platonic Representation Hypothesis that networks converge to similar representations across architectures.

Method: Extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, comparing inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators.

Result: Lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions - a property unique to neural operators.

Conclusion: The framework successfully enables mechanistic interpretability of large neural operators and demonstrates that lifting and operator approaches provide significant advantages for representation recovery and cross-resolution inference in scientific computing applications.

Abstract: We frame the problem of unifying representations in neural models as one of sparse model recovery and introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO). While the Platonic Representation Hypothesis suggests that neural networks converge to similar representations across architectures, the representational properties of neural operators remain underexplored despite their growing importance in scientific computing. We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators. We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.

[294] Mapping on a Budget: Optimizing Spatial Data Collection for ML

Livia Betti, Farooq Sanni, Gnouyaro Sogoyou, Togbe Agbagla, Cullen Molitor, Tamma Carleton, Esther Rolf

Main category: cs.LG

TL;DR: First framework for optimizing spatial training data collection in satellite ML, addressing data sparsity and cost constraints with novel methods that show significant performance gains.

Details

Motivation: Satellite ML applications suffer from sparse, clustered labeled data despite global coverage, creating uncertainty about how to collect additional data effectively.

Method: Novel problem formulation for spatial training data optimization with heterogeneous costs and budget constraints, plus new methods to address this challenge.

Result: Experiments across three continents and four tasks show substantial performance gains from optimized sampling strategies.

Conclusion: The framework generalizes across SatML domains and provides practical guidance for data collection, with immediate application to agricultural monitoring in Togo.

Abstract: In applications across agriculture, ecology, and human development, machine learning with satellite imagery (SatML) is limited by the sparsity of labeled training data. While satellite data cover the globe, labeled training datasets for SatML are often small, spatially clustered, and collected for other purposes (e.g., administrative surveys or field measurements). Despite the pervasiveness of this issue in practice, past SatML research has largely focused on new model architectures and training algorithms to handle scarce training data, rather than modeling data conditions directly. This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. In experiments simulating different problem settings across three continents and four tasks, our strategies reveal substantial gains from sample optimization. Further experiments delineate settings for which optimized sampling is particularly effective. The problem formulation and methods we introduce are designed to generalize across application domains for SatML; we put special emphasis on a specific problem setting where our coauthors can immediately use our findings to augment clustered agricultural surveys for SatML monitoring in Togo.

[295] Learning functions through Diffusion Maps

Alvaro Almeida Gomez

Main category: cs.LG

TL;DR: A data-driven method for function approximation on manifolds using Diffusion Maps and dimensionality reduction via SVD, with online updating capability for new data.

Details

Motivation: To develop an efficient method for approximating real-valued functions on smooth manifolds that can handle high-dimensional data and incorporate new data efficiently.

Method: Builds on Diffusion Maps framework, uses diffusion geometry connected to heat equation and Laplace-Beltrami operator. Introduces dimensionality reduction via SVD of distance matrix and online updating mechanism for new data.

Result: Outperforms classical feedforward neural networks and interpolation methods in both accuracy and efficiency, as demonstrated in sparse CT reconstruction applications.

Conclusion: The proposed methodology provides an effective and scalable approach for function approximation on manifolds with superior performance compared to traditional methods.

Abstract: We propose a data-driven method for approximating real-valued functions on smooth manifolds, building on the Diffusion Maps framework under the manifold hypothesis. Given pointwise evaluations of a function, the method constructs a smooth extension to the ambient space by exploiting diffusion geometry and its connection to the heat equation and the Laplace-Beltrami operator. To address the computational challenges of high-dimensional data, we introduce a dimensionality reduction strategy based on the low-rank structure of the distance matrix, revealed via singular value decomposition (SVD). In addition, we develop an online updating mechanism that enables efficient incorporation of new data, thereby improving scalability and reducing computational cost. Numerical experiments, including applications to sparse CT reconstruction, demonstrate that the proposed methodology outperforms classical feedforward neural networks and interpolation methods in terms of both accuracy and efficiency.

[296] What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: The paper shows that low-rank structure in reward matrices enables efficient sparse-reward RL, with a sharp transition from exponential to polynomial sample complexity. It introduces Policy-Aware Matrix Completion (PAMC) that connects matrix completion theory with RL.

Details

Motivation: To understand what fundamental properties of reward functions enable efficient sparse-reward reinforcement learning, particularly through the lens of low-rank structure in reward matrices.

Method: Introduces Policy-Aware Matrix Completion (PAMC) framework that connects matrix completion theory with reinforcement learning via policy-dependent sampling analysis. Includes impossibility results, reward-free representation learning, distribution-free confidence sets, and robust completion guarantees.

Result: Empirical evaluation across 100 domains shows exploitable structure in over half. PAMC improves sample efficiency by factors between 1.6 and 2.1 compared to baselines, with only about 20% computational overhead.

Conclusion: Establishes structural reward learning as a promising new paradigm with implications for robotics, healthcare, and other safety-critical, sample-expensive applications.

Abstract: What fundamental properties of reward functions enable efficient sparse-reward reinforcement learning? We address this question through the lens of low-rank structure in reward matrices, showing that such structure induces a sharp transition from exponential to polynomial sample complexity, the first result of this kind for sparse-reward RL. We introduce Policy-Aware Matrix Completion (PAMC), which connects matrix completion theory with reinforcement learning via a new analysis of policy-dependent sampling. Our framework provides: (i) impossibility results for general sparse reward observation, (ii) reward-free representation learning from dynamics, (iii) distribution-free confidence sets via conformal prediction, and (iv) robust completion guarantees that degrade gracefully when low-rank structure is only approximate. Empirically, we conduct a pre-registered evaluation across 100 systematically sampled domains, finding exploitable structure in over half. PAMC improves sample efficiency by factors between 1.6 and 2.1 compared to strong exploration, structured, and representation-learning baselines, while adding only about 20 percent computational overhead.These results establish structural reward learning as a promising new paradigm, with immediate implications for robotics, healthcare, and other safety-critical, sample-expensive applications.

[297] Online time series prediction using feature adjustment

Xiannan Huang, Shuhan Qiu, Jiayuan Du, Chao Yang

Main category: cs.LG

TL;DR: ADAPT-Z is a novel online learning method for time series forecasting that addresses distribution shift by updating feature representations of latent factors instead of conventional parameter selection, and handles delayed feedback in multi-step forecasting using historical gradient information.

Details

Motivation: Time series forecasting faces challenges from distribution shift, especially in online deployment where data arrives sequentially. Current methods focus on parameter selection and update strategies, but the authors propose that distribution shifts come from changes in underlying latent factors, making feature representation updates more effective.

Method: ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space) uses an adapter module that combines current feature representations with historical gradient information to enable robust parameter updates despite delayed feedback in multi-step forecasting.

Result: Extensive experiments show that ADAPT-Z consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets.

Conclusion: Updating feature representations of latent factors is more effective than conventional parameter selection for handling distribution shift in time series forecasting, and ADAPT-Z successfully addresses the delayed feedback problem in online learning scenarios.

Abstract: Time series forecasting is of significant importance across various domains. However, it faces significant challenges due to distribution shift. This issue becomes particularly pronounced in online deployment scenarios where data arrives sequentially, requiring models to adapt continually to evolving patterns. Current time series online learning methods focus on two main aspects: selecting suitable parameters to update (e.g., final layer weights or adapter modules) and devising suitable update strategies (e.g., using recent batches, replay buffers, or averaged gradients). We challenge the conventional parameter selection approach, proposing that distribution shifts stem from changes in underlying latent factors influencing the data. Consequently, updating the feature representations of these latent factors may be more effective. To address the critical problem of delayed feedback in multi-step forecasting (where true values arrive much later than predictions), we introduce ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space). ADAPT-Z utilizes an adapter module that leverages current feature representations combined with historical gradient information to enable robust parameter updates despite the delay. Extensive experiments demonstrate that our method consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets. The code is available at https://github.com/xiannanhuang/ADAPT-Z.

[298] Machine Learning for LiDAR-Based Indoor Surface Classification in Intelligent Wireless Environments

Parth Ashokbhai Shiroya, Swarnagowri Shashidhar, Amod Ashtekar, Krishna Aindrila Kar, Rafaela Lomboy, Dalton Davis, Mohammed E. Eltayeb

Main category: cs.LG

TL;DR: LiDAR-driven ML framework classifies indoor surfaces into semi-specular and low-specular categories using optical reflectivity as proxy for EM scattering, enabling scatter-aware environment maps for mmWave/sub-THz networks.

Details

Motivation: Reliable mmWave and sub-THz connectivity depends on surface reflections, but signals are vulnerable to blockage. Surface roughness determines scattering behavior (specular vs diffuse), which is crucial for network performance.

Method: Collected 78,000+ points from 15 indoor materials, partitioned into 3cm patches. Extracted patch-level features (elevation angle, log-scaled intensity, max-to-mean ratio) and trained Random Forest, XGBoost, and neural network classifiers.

Result: Ensemble tree-based models (Random Forest, XGBoost) provided the best accuracy-robustness trade-off, confirming LiDAR features effectively capture roughness-induced scattering effects.

Conclusion: The framework enables scatter-aware environment maps and digital twins, supporting adaptive beam management, blockage recovery, and environment-aware connectivity in next-gen networks.

Abstract: Reliable connectivity in millimeter-wave (mmWave) and sub-terahertz (sub-THz) networks depends on reflections from surrounding surfaces, as high-frequency signals are highly vulnerable to blockage. The scattering behavior of a surface is determined not only by material permittivity but also by roughness, which governs whether energy remains in the specular direction or is diffusely scattered. This paper presents a LiDAR-driven machine learning framework for classifying indoor surfaces into semi-specular and low-specular categories, using optical reflectivity as a proxy for electromagnetic scattering behavior. A dataset of over 78,000 points from 15 representative indoor materials was collected and partitioned into 3 cm x 3 cm patches to enable classification from partial views. Patch-level features capturing geometry and intensity, including elevation angle, natural-log-scaled intensity, and max-to-mean ratio, were extracted and used to train Random Forest, XGBoost, and neural network classifiers. Results show that ensemble tree-based models consistently provide the best trade-off between accuracy and robustness, confirming that LiDAR-derived features capture roughness-induced scattering effects. The proposed framework enables the generation of scatter aware environment maps and digital twins, supporting adaptive beam management, blockage recovery, and environment-aware connectivity in next-generation networks.

[299] Predicting Traffic Accident Severity with Deep Neural Networks

Meghan Bibb, Pablo Rivas, Mahee Tayba

Main category: cs.LG

TL;DR: Deep neural network achieves 92% accuracy in classifying traffic accident severity using autoencoders for dimensionality reduction and dense networks.

Details

Motivation: Traffic accidents can be studied to mitigate future risks, and recent machine learning advances provide better ways to analyze imbalanced traffic accident data with good generalization.

Method: Neural network-based approach using autoencoders for unsupervised dimensionality reduction to handle feature colinearity, followed by dense network for classification of accident severity.

Result: The proposed deep neural network achieved cross-validated results of up to 92% accuracy in classifying accident severity.

Conclusion: Deep neural networks with autoencoder-based dimensionality reduction are effective for traffic accident severity classification, demonstrating high predictive performance on imbalanced data.

Abstract: Traffic accidents can be studied to mitigate the risk of further events. Recent advances in machine learning have provided an alternative way to study data associated with traffic accidents. New models achieve good generalization and high predictive power over imbalanced data. In this research, we study neural network-based models on data related to traffic accidents. We begin analyzing relative feature colinearity and unsupervised dimensionality reduction through autoencoders, followed by a dense network. The features are related to traffic accident data and the target is to classify accident severity. Our experiments show cross-validated results of up to 92% accuracy when classifying accident severity using the proposed deep neural network.

[300] From Leiden to Pleasure Island: The Constant Potts Model for Community Detection as a Hedonic Game

Lucas Lopes Felipe, Konstantin Avrachenkov, Daniel Sadoc Menasche

Main category: cs.LG

TL;DR: Game-theoretic analysis of Constant Potts Model (CPM) showing efficient convergence, robust stability criteria, and improved accuracy in community detection.

Details

Motivation: To provide a game-theoretic perspective on CPM for network partitioning, emphasizing efficiency, robustness, and accuracy in community detection.

Method: Reinterpret CPM as potential hedonic game with local utility functions, prove convergence via better-response dynamics, introduce two stability criteria (strict robustness and relaxed utility), and test in community tracking scenarios with Leiden algorithm.

Result: Local optimization converges in pseudo-polynomial time, robust partitions yield higher accuracy in recovering ground-truth communities when used to bootstrap Leiden algorithm.

Conclusion: Game-theoretic framework provides efficient, robust, and accurate approach to community detection with CPM, particularly beneficial for community tracking applications.

Abstract: Community detection is one of the fundamental problems in data science which consists of partitioning nodes into disjoint communities. We present a game-theoretic perspective on the Constant Potts Model (CPM) for partitioning networks into disjoint communities, emphasizing its efficiency, robustness, and accuracy. Efficiency: We reinterpret CPM as a potential hedonic game by decomposing its global Hamiltonian into local utility functions, where the local utility gain of each agent matches the corresponding increase in global utility. Leveraging this equivalence, we prove that local optimization of the CPM objective via better-response dynamics converges in pseudo-polynomial time to an equilibrium partition. Robustness: We introduce and relate two stability criteria: a strict criterion based on a novel notion of robustness, requiring nodes to simultaneously maximize neighbors and minimize non-neighbors within communities, and a relaxed utility function based on a weighted sum of these objectives, controlled by a resolution parameter. Accuracy: In community tracking scenarios, where initial partitions are used to bootstrap the Leiden algorithm with partial ground-truth information, our experiments reveal that robust partitions yield higher accuracy in recovering ground-truth communities.

[301] Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models

Kimia Ehsani, Walid Saad

Main category: cs.LG

TL;DR: A lightweight BEV injection connector enhances MLLMs for V2I link prediction by providing 3D spatial context, improving accuracy by up to 13.9% overall and 32.7% in adverse conditions.

Details

Motivation: Accurate prediction of V2I communication link quality is crucial for smooth handovers and reliable low-latency communication, but MLLMs lack 3D spatial understanding despite having access to vehicle sensor data.

Method: Proposed a plug-and-play BEV injection connector that constructs environmental BEV from neighboring vehicles’ sensing data and fuses it with ego vehicle input. Developed a co-simulation environment (CARLA + MATLAB ray tracing) to generate multimodal data and extract ground-truth instructions.

Result: The BEV injection framework consistently improved performance across all three V2I tasks (LoS/NLoS classification, link availability, blockage prediction), with up to 13.9% macro-average accuracy improvement over ego-only baseline, and up to 32.7% gain in challenging rainy/nighttime conditions.

Conclusion: The proposed BEV injection approach effectively addresses MLLMs’ spatial understanding limitations and demonstrates robust performance improvement in V2I link prediction, particularly in adverse environmental conditions.

Abstract: Accurate prediction of communication link quality metrics is essential for vehicle-to-infrastructure (V2I) systems, enabling smooth handovers, efficient beam management, and reliable low-latency communication. The increasing availability of sensor data from modern vehicles motivates the use of multimodal large language models (MLLMs) because of their adaptability across tasks and reasoning capabilities. However, MLLMs inherently lack three-dimensional spatial understanding. To overcome this limitation, a lightweight, plug-and-play bird’s-eye view (BEV) injection connector is proposed. In this framework, a BEV of the environment is constructed by collecting sensing data from neighboring vehicles. This BEV representation is then fused with the ego vehicle’s input to provide spatial context for the large language model. To support realistic multimodal learning, a co-simulation environment combining CARLA simulator and MATLAB-based ray tracing is developed to generate RGB, LiDAR, GPS, and wireless signal data across varied scenarios. Instructions and ground-truth responses are programmatically extracted from the ray-tracing outputs. Extensive experiments are conducted across three V2I link prediction tasks: line-of-sight (LoS) versus non-line-of-sight (NLoS) classification, link availability, and blockage prediction. Simulation results show that the proposed BEV injection framework consistently improved performance across all tasks. The results indicate that, compared to an ego-only baseline, the proposed approach improves the macro-average of the accuracy metrics by up to 13.9%. The results also show that this performance gain increases by up to 32.7% under challenging rainy and nighttime conditions, confirming the robustness of the framework in adverse settings.

[302] Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables

Yang Chen, Xiao Lin, Bo Yan, Libo Zhang, Jiamou Liu, Neset Özkan Tan, Michael Witbrock

Main category: cs.LG

TL;DR: Proposes a deep latent variable MFG model for inverse reinforcement learning that can handle heterogeneous agent objectives without prior context knowledge, outperforming state-of-the-art methods.

Details

Motivation: Existing IRL methods in mean field games assume agent homogeneity, limiting their ability to handle real-world scenarios with heterogeneous and unknown objectives from expert demonstrations.

Method: Developed a deep latent variable MFG model and associated IRL method that can infer rewards from different but structurally similar tasks without requiring prior knowledge about underlying contexts or model modifications.

Result: Experiments on simulated scenarios and a real-world spatial taxi-ride pricing problem demonstrate superior performance over state-of-the-art IRL methods in MFGs.

Conclusion: The proposed approach effectively addresses the limitation of agent homogeneity assumption in existing MFG IRL methods, enabling practical inference of reward functions from heterogeneous expert demonstrations.

Abstract: Designing suitable reward functions for numerous interacting intelligent agents is challenging in real-world applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs.

[303] Data-Augmented Quantization-Aware Knowledge Distillation

Justin Kur, Kaiqi Zhao

Main category: cs.LG

TL;DR: A novel metric for selecting optimal data augmentation strategies in quantization-aware knowledge distillation that maximizes contextual mutual information while maintaining prediction accuracy.

Details

Motivation: Limited attention has been given to understanding how input transformations like data augmentation affect quantization-aware training and knowledge distillation, especially for low-precision models.

Method: Proposes a metric that evaluates data augmentations based on their capacity to maximize contextual mutual information while ensuring predictions remain close to ground truth labels. The method automatically ranks and selects data augmentation strategies with minimal training overhead.

Result: Extensive evaluations show that selecting data augmentation strategies using the proposed metric significantly improves state-of-the-art quantization-aware training and knowledge distillation works across various model architectures and datasets.

Conclusion: The proposed method effectively addresses the challenge of selecting optimal data augmentation in quantization-aware knowledge distillation, providing a compatible and efficient solution that enhances performance of low-bit deep learning models.

Abstract: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. Existing KD and QAT works focus on improving the accuracy of quantized models from the network output perspective by designing better KD loss functions or optimizing QAT’s forward and backward propagation. However, limited attention has been given to understanding the impact of input transformations, such as data augmentation (DA). The relationship between quantization-aware KD and DA remains unexplored. In this paper, we address the question: how to select a good DA in quantization-aware KD, especially for the models with low precisions? We propose a novel metric which evaluates DAs according to their capacity to maximize the Contextual Mutual Information–the information not directly related to an image’s label–while also ensuring the predictions for each class are close to the ground truth labels on average. The proposed method automatically ranks and selects DAs, requiring minimal training overhead, and it is compatible with any KD or QAT algorithm. Extensive evaluations demonstrate that selecting DA strategies using our metric significantly improves state-of-the-art QAT and KD works across various model architectures and datasets.

[304] MillGNN: Learning Multi-Scale Lead-Lag Dependencies for Multi-Variate Time Series Forecasting

Binqing Wu, Zongjiang Shang, Jianlong Huang, Ling Chen

Main category: cs.LG

TL;DR: MillGNN is a novel graph neural network method that captures multi-scale lead-lag dependencies in multivariate time series forecasting, outperforming 16 state-of-the-art methods on 11 datasets.

Details

Motivation: Existing MTS forecasting methods overlook lead-lag dependencies at multiple grouping scales, failing to capture hierarchical lead-lag effects in complex systems.

Method: Uses scale-specific lead-lag graph learning with cross-correlation and dynamic decaying features, plus hierarchical lead-lag message passing at multiple grouping scales.

Result: Superior performance demonstrated on 11 datasets for both long-term and short-term MTS forecasting compared to 16 state-of-the-art methods.

Conclusion: MillGNN effectively captures comprehensive lead-lag effects with statistical interpretability and data-driven flexibility, achieving state-of-the-art results.

Abstract: Multi-variate time series (MTS) forecasting is crucial for various applications. Existing methods have shown promising results owing to their strong ability to capture intra- and inter-variate dependencies. However, these methods often overlook lead-lag dependencies at multiple grouping scales, failing to capture hierarchical lead-lag effects in complex systems. To this end, we propose MillGNN, a novel \underline{g}raph \underline{n}eural \underline{n}etwork-based method that learns \underline{m}ult\underline{i}ple grouping scale \underline{l}ead-\underline{l}ag dependencies for MTS forecasting, which can comprehensively capture lead-lag effects considering variate-wise and group-wise dynamics and decays. Specifically, MillGNN introduces two key innovations: (1) a scale-specific lead-lag graph learning module that integrates cross-correlation coefficients and dynamic decaying features derived from real-time inputs and time lags to learn lead-lag dependencies for each scale, which can model evolving lead-lag dependencies with statistical interpretability and data-driven flexibility; (2) a hierarchical lead-lag message passing module that passes lead-lag messages at multiple grouping scales in a structured way to simultaneously propagate intra- and inter-scale lead-lag effects, which can capture multi-scale lead-lag effects with a balance of comprehensiveness and efficiency. Experimental results on 11 datasets demonstrate the superiority of MillGNN for long-term and short-term MTS forecasting, compared with 16 state-of-the-art methods.

[305] Peptidomic-Based Prediction Model for Coronary Heart Disease Using a Multilayer Perceptron Neural Network

Jesus Celis-Porras

Main category: cs.LG

TL;DR: MLP neural network model using urinary peptide biomarkers achieves high accuracy (95.67%) for non-invasive coronary heart disease diagnosis.

Details

Motivation: Coronary heart disease is a leading global cause of death with high healthcare costs, necessitating non-invasive diagnostic approaches.

Method: Multilayer perceptron neural network trained on 50 urinary peptide biomarkers selected via genetic algorithms, with SMOTE balancing and stratified validation on 690 individuals.

Result: Achieved 95.67% precision, sensitivity, and specificity, with F1-score of 0.9565, AUC of 0.9748, MCC of 0.9134, and Cohen’s kappa of 0.9131.

Conclusion: The model provides a highly accurate and robust non-invasive diagnostic tool for coronary heart disease detection.

Abstract: Coronary heart disease (CHD) is a leading cause of death worldwide and contributes significantly to annual healthcare expenditures. To develop a non-invasive diagnostic approach, we designed a model based on a multilayer perceptron (MLP) neural network, trained on 50 key urinary peptide biomarkers selected via genetic algorithms. Treatment and control groups, each comprising 345 individuals, were balanced using the Synthetic Minority Over-sampling Technique (SMOTE). The neural network was trained using a stratified validation strategy. Using a network with three hidden layers of 60 neurons each and an output layer of two neurons, the model achieved a precision, sensitivity, and specificity of 95.67 percent, with an F1-score of 0.9565. The area under the ROC curve (AUC) reached 0.9748 for both classes, while the Matthews correlation coefficient (MCC) and Cohen’s kappa coefficient were 0.9134 and 0.9131, respectively, demonstrating its reliability in detecting CHD. These results indicate that the model provides a highly accurate and robust non-invasive diagnostic tool for coronary heart disease.

[306] Topotein: Topological Deep Learning for Protein Representation Learning

Zhiyu Wang, Arian Jamasb, Mustafa Hajij, Alex Morehead, Luke Braithwaite, Pietro Liò

Main category: cs.LG

TL;DR: Topotein framework uses topological deep learning with Protein Combinatorial Complex and TCPNet to capture hierarchical protein structures, outperforming state-of-the-art methods in protein representation learning tasks.

Details

Motivation: Current protein representation learning methods fail to capture the hierarchical organization inherent in protein structures, limiting their ability to understand structure-function relationships.

Method: Introduces Topotein framework with Protein Combinatorial Complex (PCC) to represent proteins at multiple hierarchical levels and Topology-Complete Perceptron Network (TCPNet) using SE(3)-equivariant message passing across hierarchical structures.

Result: TCPNet consistently outperforms state-of-the-art geometric graph neural networks across four protein representation learning tasks, particularly excelling in fold classification that requires understanding secondary structure arrangements.

Conclusion: Hierarchical topological features are crucial for effective protein analysis, and the Topotein framework successfully captures multi-scale structural patterns that current methods miss.

Abstract: Protein representation learning (PRL) is crucial for understanding structure-function relationships, yet current sequence- and graph-based methods fail to capture the hierarchical organization inherent in protein structures. We introduce Topotein, a comprehensive framework that applies topological deep learning to PRL through the novel Protein Combinatorial Complex (PCC) and Topology-Complete Perceptron Network (TCPNet). Our PCC represents proteins at multiple hierarchical levels – from residues to secondary structures to complete proteins – while preserving geometric information at each level. TCPNet employs SE(3)-equivariant message passing across these hierarchical structures, enabling more effective capture of multi-scale structural patterns. Through extensive experiments on four PRL tasks, TCPNet consistently outperforms state-of-the-art geometric graph neural networks. Our approach demonstrates particular strength in tasks such as fold classification which require understanding of secondary structure arrangements, validating the importance of hierarchical topological features for protein analysis.

[307] Mistake-bounded online learning with operation caps

Jesse Geneson, Meien Li, Linus Tang

Main category: cs.LG

TL;DR: Analysis of mistake-bound online learning with arithmetic operation caps, solving problems in agnostic learning with bandit feedback and extending to operation-constrained settings.

Details

Motivation: To understand the computational efficiency limits of online learning algorithms by examining the minimum arithmetic operations required per round to learn function families with bounded mistakes.

Method: Proving general bounds on arithmetic operations per round for arbitrary function families, solving agnostic mistake-bounded online learning problems with bandit feedback from previous works.

Result: Established fundamental bounds on computational requirements for online learning, resolved open problems in agnostic learning with bandit feedback, and extended results to operation-capped settings.

Conclusion: The paper provides important theoretical foundations for understanding the computational complexity of online learning algorithms under operation constraints, with implications for efficient algorithm design.

Abstract: We investigate the mistake-bound model of online learning with caps on the number of arithmetic operations per round. We prove general bounds on the minimum number of arithmetic operations per round that are necessary to learn an arbitrary family of functions with finitely many mistakes. We solve a problem on agnostic mistake-bounded online learning with bandit feedback from (Filmus et al, 2024) and (Geneson & Tang, 2024). We also extend this result to the setting of operation caps.

[308] Formal Verification of Local Robustness of a Classification Algorithm for a Spatial Use Case

Delphine Longuet, Amira Elouazzani, Alejandro Penacho Riveiros, Nicola Bastianello

Main category: cs.LG

TL;DR: Formal verification of neural network robustness for satellite fault detection systems using Marabou tool to ensure high reliability under uncertainty.

Details

Motivation: Satellite component failures are costly and difficult to address, requiring early detection through AI systems that must operate with extremely high reliability.

Method: Employ formal verification tool Marabou to verify local robustness of neural network models, quantifying how much input can be perturbed before output becomes unstable.

Result: The approach improves trustworthiness of AI-based fault detection systems by ensuring stable performance under uncertainty.

Conclusion: Formal verification of neural network robustness is essential for dependable satellite fault detection systems, reducing the burden of costly component failures.

Abstract: Failures in satellite components are costly and challenging to address, often requiring significant human and material resources. Embedding a hybrid AI-based system for fault detection directly in the satellite can greatly reduce this burden by allowing earlier detection. However, such systems must operate with extremely high reliability. To ensure this level of dependability, we employ the formal verification tool Marabou to verify the local robustness of the neural network models used in the AI-based algorithm. This tool allows us to quantify how much a model’s input can be perturbed before its output behavior becomes unstable, thereby improving trustworthiness with respect to its performance under uncertainty.

[309] On Aligning Prediction Models with Clinical Experiential Learning: A Prostate Cancer Case Study

Jacqueline J. Vallon, William Overman, Wanqiao Xu, Neil Panjwani, Xi Ling, Sushmita Vij, Hilary P. Bagshaw, John T. Leppert, Sumit Shah, Geoffrey Sonn, Sandy Srinivas, Erqi Pollom, Mark K. Buyyounouski, Mohsen Bayati

Main category: cs.LG

TL;DR: A framework for aligning ML models with clinical knowledge through constraints, showing improved alignment without performance loss in prostate cancer prediction.

Details

Motivation: Machine learning models in healthcare often fail to capture clinically meaningful patterns that align with medical expertise, such as non-monotonic relationships that contradict clinical experience.

Method: Proposed a reproducible framework incorporating clinical knowledge via constraints into ML models, using survey-collected clinical expertise and analyzing impact across underspecification degrees. Also conducted randomized experiments with clinicians to test feedback-driven alignment.

Result: Constrained models aligned with clinical experiential learning without compromising performance. Larger prediction differences between constrained/unconstrained models led to more apparent differences in clinical interpretation.

Conclusion: It’s feasible to align ML models with clinical knowledge through constraint incorporation, and clinician feedback can effectively guide model alignment in clinical risk prediction applications.

Abstract: Over the past decade, the use of machine learning (ML) models in healthcare applications has rapidly increased. Despite high performance, modern ML models do not always capture patterns the end user requires. For example, a model may predict a non-monotonically decreasing relationship between cancer stage and survival, keeping all other features fixed. In this paper, we present a reproducible framework for investigating this misalignment between model behavior and clinical experiential learning, focusing on the effects of underspecification of modern ML pipelines. In a prostate cancer outcome prediction case study, we first identify and address these inconsistencies by incorporating clinical knowledge, collected by a survey, via constraints into the ML model, and subsequently analyze the impact on model performance and behavior across degrees of underspecification. The approach shows that aligning the ML model with clinical experiential learning is possible without compromising performance. Motivated by recent literature in generative AI, we further examine the feasibility of a feedback-driven alignment approach in non-generative AI clinical risk prediction models through a randomized experiment with clinicians. Our findings illustrate that, by eliciting clinicians’ model preferences using our proposed methodology, the larger the difference in how the constrained and unconstrained models make predictions for a patient, the more apparent the difference is in clinical interpretation.

[310] FedQuad: Federated Stochastic Quadruplet Learning to Mitigate Data Heterogeneity

Ozgu Goksu, Nicolas Pugeault

Main category: cs.LG

TL;DR: FedQuad is a novel federated learning method that addresses data heterogeneity by optimizing intra-class variance reduction and inter-class variance expansion across clients, improving global model generalization through metric learning techniques.

Details

Motivation: Federated Learning faces challenges with data heterogeneity among clients, especially with limited dataset sizes and class imbalance, which negatively impacts global model generalization during aggregation.

Method: FedQuad explicitly optimizes smaller intra-class variance and larger inter-class variance across clients by minimizing distance between similar pairs and maximizing distance between negative pairs, effectively disentangling client data in the shared feature space.

Result: The method demonstrates superior performance on CIFAR-10 and CIFAR-100 datasets under various data distributions with many clients, outperforming existing approaches.

Conclusion: Metric learning-based strategies are effective for addressing representational learning challenges in federated settings, with FedQuad providing a robust solution for data heterogeneity issues in FL.

Abstract: Federated Learning (FL) provides decentralised model training, which effectively tackles problems such as distributed data and privacy preservation. However, the generalisation of global models frequently faces challenges from data heterogeneity among clients. This challenge becomes even more pronounced when datasets are limited in size and class imbalance. To address data heterogeneity, we propose a novel method, \textit{FedQuad}, that explicitly optimises smaller intra-class variance and larger inter-class variance across clients, thereby decreasing the negative impact of model aggregation on the global model over client representations. Our approach minimises the distance between similar pairs while maximising the distance between negative pairs, effectively disentangling client data in the shared feature space. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets under various data distributions and with many clients, demonstrating superior performance compared to existing approaches. Furthermore, we provide a detailed analysis of metric learning-based strategies within both supervised and federated learning paradigms, highlighting their efficacy in addressing representational learning challenges in federated settings.

[311] Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference

Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone

Main category: cs.LG

TL;DR: SP-CCI is a new method that uses synthetic counterfactual data to create tighter prediction intervals while maintaining coverage guarantees, addressing the over-conservatism of existing conformal counterfactual inference methods.

Details

Motivation: Existing conformal counterfactual inference methods provide marginal coverage but often produce overly conservative intervals, especially under treatment imbalance when counterfactual samples are scarce.

Method: SP-CCI augments calibration sets with synthetic counterfactual labels from pre-trained models, using risk-controlling prediction sets with debiasing informed by prediction-powered inference to ensure validity.

Result: Theoretical guarantees show SP-CCI achieves tighter prediction intervals while preserving marginal coverage. Empirical results confirm consistent reduction in interval width compared to standard CCI across datasets.

Conclusion: SP-CCI provides a valid framework for more efficient counterfactual inference by leveraging synthetic data, offering theoretical guarantees and practical improvements over existing methods.

Abstract: This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.

Ainhize Barrainkua, Giovanni De Toni, Jose Antonio Lozano, Novi Quadrianto

Main category: cs.LG

TL;DR: This paper addresses fairness concerns in algorithmic recourse systems, providing theoretical analysis of unfairness and introducing a new fairness framework based on social burden with a practical algorithm (MISOB) that reduces social burden across groups while maintaining classifier accuracy.

Details

Motivation: Emerging legislation mandates that classifiers must provide actionable recourse for negative decisions, but existing research shows fairness concerns in the recourse process itself. The paper aims to address these fairness gaps in algorithmic recourse systems.

Method: The authors provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees between recourse and classification. They introduce a novel fairness framework based on social burden and develop a practical algorithm called MISOB that is broadly applicable under real-world conditions.

Result: Empirical results on real-world datasets demonstrate that MISOB reduces social burden across all groups without compromising overall classifier accuracy, showing practical effectiveness of the proposed approach.

Conclusion: The work establishes theoretical foundations for understanding unfairness in algorithmic recourse, challenges the limitations of standard equal cost paradigms, and provides a practical solution (MISOB) that successfully addresses social burden fairness concerns while maintaining classification performance.

Abstract: Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.

[313] TAGAL: Tabular Data Generation using Agentic LLM Methods

Benoît Ronval, Pierre Dupont, Siegfried Nijssen

Main category: cs.LG

TL;DR: TAGAL is an agentic workflow method that uses LLMs to generate synthetic tabular data without additional training, performing on par with state-of-the-art approaches that require LLM training.

Details

Motivation: To improve machine learning classification performance through synthetic data generation, particularly for tabular data, using LLMs without the need for additional training.

Method: Uses an agentic workflow with Large Language Models for automatic, iterative data generation with feedback loops to improve quality, and incorporates external knowledge.

Result: TAGAL performs on par with state-of-the-art approaches requiring LLM training and generally outperforms other training-free methods across diverse datasets and quality metrics.

Conclusion: The agentic workflow shows strong potential for LLM-based data generation, opening new directions for synthetic data creation methods without additional model training.

Abstract: The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.

[314] Attention as an Adaptive Filter

Peter Racioppo

Main category: cs.LG

TL;DR: AFA is a novel attention mechanism that incorporates a learnable dynamics model using linear stochastic differential equations, providing robust attention weights through maximum likelihood estimation of the SDE.

Details

Motivation: To develop a more robust and theoretically grounded attention mechanism by modeling input sequences as observations of a linear stochastic differential equation rather than using direct query-key comparisons.

Method: Models input sequence as discrete observations of a linear SDE with simultaneously diagonalizable state matrices and noise covariances. Uses closed-form solution to differential Lyapunov equation to propagate pairwise uncertainties. Attention weights emerge as maximum likelihood solution with robust residual-based reweightings.

Result: Developed AFA mechanism that can recover standard dot-product attention as a special case (vanishing dynamics and process noise with small-angle approximation). Created a simplified variant with same computational/memory complexity as standard attention.

Conclusion: AFA provides a principled, dynamics-based approach to attention that offers robustness benefits while maintaining computational efficiency and generalizing standard attention mechanisms.

Abstract: We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By imposing a linear dynamics model with simultaneously diagonalizable state matrices and noise covariances, we can make use of a closed-form solution to the differential Lyapunov equation to efficiently propagate pairwise uncertainties through the dynamics. Attention naturally arises as the maximum likelihood solution for this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated pairwise precisions. Imposing an additional constraint on the state matrix’s eigenvalues leads to a simplified variant with the same computational and memory complexity as standard attention. In the limit of vanishing dynamics and process noise, and using a small-angle approximation, we recover ordinary dot-product attention.

[315] Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference

Nicolas Johansson, Tobias Olsson, Daniel Nilsson, Johan Östman, Fazeleh Hoseini

Main category: cs.LG

TL;DR: Time series forecasting models are vulnerable to membership inference attacks, with user-level attacks achieving perfect detection. New attacks DTS and adapted LiRA outperform existing methods, showing increased risk with longer prediction horizons and smaller training sets.

Details

Motivation: Membership inference attacks have been extensively studied on classification models but remain largely unexplored for time series forecasting, creating a significant research gap in understanding privacy risks in this domain.

Method: Introduced two new attacks: (1) adapted multivariate LiRA from classification to time-series forecasting, and (2) novel end-to-end Deep Time Series (DTS) attack. Benchmarked against adapted classification attacks on TUH-EEG and ELD datasets using LSTM and N-HiTS architectures under record- and user-level threat models.

Result: Forecasting models are vulnerable to membership inference, with user-level attacks often achieving perfect detection. Proposed methods (DTS and adapted LiRA) achieved strongest performance in several settings. Vulnerability increases with longer prediction horizons and smaller training populations.

Conclusion: Time series forecasting models present significant privacy risks through membership inference attacks, establishing new baselines for privacy risk assessment in this domain with trends similar to those observed in large language models.

Abstract: Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting. We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models.

[316] Comment on “A Note on Over-Smoothing for Graph Neural Networks”

Razi Hasson, Reuven Guetta

Main category: cs.LG

TL;DR: This paper provides theoretical analysis showing that Dirichlet energy decreases exponentially with GNN depth under mild spectral conditions, including with Leaky-ReLU, and extends results to spectral polynomial filters with experimental validation.

Details

Motivation: To address and better understand the over-smoothing phenomenon in Graph Neural Networks (GNNs) through theoretical analysis of Dirichlet energy, building upon previous work by Cai and Wang (2020).

Method: Theoretical analysis under mild spectral conditions, mathematical proofs for exponential decrease of Dirichlet energy with depth, extension to spectral polynomial filters, and experimental validation through edge deletion and weight amplification tests.

Result: Demonstrated that Dirichlet energy decreases exponentially with network depth under specified conditions, provided experimental evidence showing when Dirichlet energy increases, suggesting practical methods to mitigate over-smoothing.

Conclusion: The study offers theoretical insights into over-smoothing in GNNs and identifies practical approaches (edge deletion and weight amplification) that can help alleviate this problem, contributing to better understanding and mitigation of over-smoothing in deep graph neural networks.

Abstract: We comment on Cai and Wang (2020, arXiv:2006.13318), who analyze over-smoothing in GNNs via Dirichlet energy. We show that under mild spectral conditions (including with Leaky-ReLU), the Dirichlet energy of node embeddings decreases exponentially with depth; we further extend the result to spectral polynomial filters and provide a short proof for the Leaky-ReLU case. Experiments on edge deletion and weight amplification illustrate when Dirichlet energy increases, hinting at practical ways to relieve over-smoothing.

[317] Set Block Decoding is a Language Model Inference Accelerator

Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman

Main category: cs.LG

TL;DR: Set Block Decoding (SBD) accelerates autoregressive language model inference by enabling parallel prediction of multiple non-consecutive tokens using both next token prediction and masked token prediction, achieving 3-5x speedup without accuracy loss.

Details

Motivation: Autoregressive language models face high computational and memory costs during inference, particularly in the decoding stage, making practical deployment challenging.

Method: SBD integrates standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture, allowing parallel sampling of multiple non-consecutive future tokens using discrete diffusion solvers.

Result: Fine-tuning Llama-3.1 8B and Qwen-3 8B with SBD achieved 3-5x reduction in forward passes while maintaining same performance as equivalent NTP training.

Conclusion: SBD provides a flexible and efficient decoding paradigm that accelerates generation without architectural changes, extra training parameters, or sacrificing accuracy, while maintaining KV-caching compatibility.

Abstract: Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.

[318] One-Embedding-Fits-All: Efficient Zero-Shot Time Series Forecasting by a Model Zoo

Hao-Nan Shi, Ting-Ji Huang, Lu Han, De-Chuan Zhan, Han-Jia Ye

Main category: cs.LG

TL;DR: ZooCast is a framework that intelligently combines multiple Time Series Foundation Models (TSFMs) into a model zoo, using a unified embedding space to dynamically select the best model for each forecasting task, achieving strong zero-shot performance with single-model efficiency.

Details

Motivation: Different TSFMs excel at different temporal patterns, but no single model performs best universally. This creates an opportunity to leverage the complementary strengths of multiple models through intelligent ensemble selection.

Method: Proposes ZooCast with One-Embedding-Fits-All paradigm that creates a unified representation space where each model is represented by a single embedding, enabling efficient similarity matching for dynamic model selection across tasks.

Result: Demonstrates strong performance on GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. Enables seamless addition of new models with progressive accuracy gains and negligible overhead.

Conclusion: ZooCast effectively harnesses the complementary abilities of multiple TSFMs through intelligent model selection, achieving superior zero-shot forecasting performance with efficient computational overhead.

Abstract: The proliferation of Time Series Foundation Models (TSFMs) has significantly advanced zero-shot forecasting, enabling predictions for unseen time series without task-specific fine-tuning. Extensive research has confirmed that no single TSFM excels universally, as different models exhibit preferences for distinct temporal patterns. This diversity suggests an opportunity: how to take advantage of the complementary abilities of TSFMs. To this end, we propose ZooCast, which characterizes each model’s distinct forecasting strengths. ZooCast can intelligently assemble current TSFMs into a model zoo that dynamically selects optimal models for different forecasting tasks. Our key innovation lies in the One-Embedding-Fits-All paradigm that constructs a unified representation space where each model in the zoo is represented by a single embedding, enabling efficient similarity matching for all tasks. Experiments demonstrate ZooCast’s strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. In real-world scenarios with sequential model releases, the framework seamlessly adds new models for progressive accuracy gains with negligible overhead.

[319] Why Can’t I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction Validation

Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich

Main category: cs.LG

TL;DR: This paper introduces two supervised metrics (precision and recall) to evaluate the relationship phase of dimensionality reduction, helping explain why expected cluster structures may be missing from projections and guiding hyperparameter tuning.

Details

Motivation: Existing DR quality metrics don't explain why expected cluster structures are missing from projections, and visual analytics solutions are time-consuming due to large hyperparameter spaces.

Method: Leverages a framework dividing DR into relationship and mapping phases, and introduces precision and recall metrics to evaluate how well modeled relationships align with expected cluster structures based on labels.

Result: The approach can guide hyperparameter tuning, uncover projection artifacts, and determine if expected structures are captured in relationships, making DR faster and more reliable.

Conclusion: The proposed supervised metrics provide a systematic way to evaluate the relationship phase of DR, addressing limitations of existing quality metrics and improving the reliability of dimensionality reduction processes.

Abstract: Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure. However, such a structure may not always appear in the projections. Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing. Visual Analytics solutions can help, but are often time-consuming due to the large hyperparameter space. This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly. We introduce two supervised metrics, precision and recall, to evaluate the relationship phase. These metrics quantify how well the modeled relationships align with an expected cluster structure based on some set of labels representing this structure. We illustrate their application using t-SNE and UMAP, and validate the approach through various usage scenarios. Our approach can guide hyperparameter tuning, uncover projection artifacts, and determine if the expected structure is captured in the relationships, making the DR process faster and more reliable.

[320] Rethinking the long-range dependency in Mamba/SSM and transformer models

Cong Ma, Kayvan Najarian

Main category: cs.LG

TL;DR: Theoretical analysis shows SSMs have exponential decay in long-range dependency while transformers are more flexible. A new SSM formulation is proposed to combine attention’s flexibility with SSM efficiency.

Details

Motivation: To theoretically investigate and compare the long-range dependency modeling capabilities of state-space models (SSM) and transformers, as current benchmarking lacks mathematical analysis for systematic improvement.

Method: Mathematically defined long-range dependency using derivative of hidden states wrt past inputs. Compared SSM and transformer capabilities, then proposed new SSM formulation with stability proof under Gaussian input distribution.

Result: SSM’s long-range dependency decays exponentially with sequence length (similar to RNN memory decay), while attention mechanism in transformers is more flexible without exponential decay constraints.

Conclusion: Transformers theoretically perform better at long-range dependency with sufficient resources, but new SSM formulation combines attention flexibility with SSM computational efficiency while maintaining stability.

Abstract: Long-range dependency is one of the most desired properties of recent sequence models such as state-space models (particularly Mamba) and transformer models. New model architectures are being actively developed and benchmarked for prediction tasks requiring long-range dependency. However, the capability of modeling long-range dependencies of these models has not been investigated from a theoretical perspective, which hinders a systematic improvement on this aspect. In this work, we mathematically define long-range dependency using the derivative of hidden states with respect to past inputs and compare the capability of SSM and transformer models of modeling long-range dependency based on this definition. We showed that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN. But the attention mechanism used in transformers is more flexible and is not constrained to exponential decay, which could in theory perform better at modeling long-range dependency with sufficient training data, computing resources, and proper training. To combine the flexibility of long-range dependency of attention mechanism and computation efficiency of SSM, we propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.

[321] Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation

Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang

Main category: cs.LG

TL;DR: A unified analytical framework for layer-wise noise allocation in differentially private deep learning that reveals flaws in existing methods and proposes a SNR-Consistent strategy for better privacy-utility tradeoffs.

Details

Motivation: Existing layer-wise Gaussian mechanisms use heuristic noise allocation strategies without rigorous theoretical grounding, lacking understanding of how noise allocation connects to formal privacy-utility tradeoffs.

Method: Proposes a SNR-Consistent noise allocation strategy that systematically connects layer-wise noise injection with optimization objectives and privacy budget allocations, ensuring inter-layer signal-to-noise ratio consistency.

Result: Extensive experiments in centralized and federated learning settings show the method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs.

Conclusion: The framework provides diagnostic insights into prior methods and theoretical guidance for designing adaptive and effective noise injection schemes in deep models.

Abstract: Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives – either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.

[322] Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models

Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Cholatid Ratanatharathorn, Panu Looareesuwan

Main category: cs.LG

TL;DR: Deep learning models successfully generated high-fidelity synthetic heart failure data that preserves privacy while maintaining statistical utility for research.

Details

Motivation: Heart failure research faces data sharing limitations due to privacy regulations and institutional barriers, requiring synthetic data solutions.

Method: Used five deep learning models (TVAE, normalizing flow, ADSGAN, SurvivalGAN, TabDDPM) to generate synthetic datasets from 12,552 patient records, evaluated with statistical similarity metrics and survival prediction.

Result: SurvivalGAN and TabDDPM showed highest fidelity with similar distributions and survival curves. SurvivalGAN and TVAE achieved strong survival prediction performance (C-indices 0.71-0.76) matching real data, while ensuring privacy protection.

Conclusion: Deep learning synthetic data generation produces privacy-preserving, high-fidelity HF datasets that overcome data sharing barriers and advance research.

Abstract: Background: Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers. Synthetic data generation offers a promising solution to overcome these challenges while preserving patient confidentiality. Methods: We generated synthetic HF datasets from institutional data comprising 12,552 unique patients using five deep learning models: tabular variational autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular denoising diffusion probabilistic models (TabDDPM). We comprehensively evaluated synthetic data utility through statistical similarity metrics, survival prediction using machine learning and privacy assessments. Results: SurvivalGAN and TabDDPM demonstrated high fidelity to the original dataset, exhibiting similar variable distributions and survival curves after applying histogram equalization. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices: 0.73-0.76) achieved the strongest performance in survival prediction evaluation, closely matched real data performance (C-indices: 0.73-0.76). Privacy evaluation confirmed protection against re-identification attacks. Conclusions: Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications. This publicly available synthetic dataset addresses critical data sharing barriers and provides a valuable resource for advancing HF research and predictive modeling.

[323] RL’s Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

Main category: cs.LG

TL;DR: RL fine-tuning preserves prior knowledge better than SFT due to smaller distributional shift, measured by KL-divergence between fine-tuned and base policies.

Details

Motivation: To understand why reinforcement learning fine-tuning preserves model capabilities better than supervised fine-tuning despite similar task performance.

Method: Theoretical analysis of KL-divergence between fine-tuned and base policies, validated through experiments with large language models and robotic foundation models.

Result: On-policy RL is biased towards KL-minimal solutions, while SFT can converge to distributions arbitrarily far from the base model, explaining RL’s better knowledge preservation.

Conclusion: RL’s Razor principle: RL prefers solutions closest in KL to the original model among all ways to solve a new task, making it superior for preserving prior knowledge during fine-tuning.

Abstract: Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL’s Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.

[324] An Interactive Framework for Finding the Optimal Trade-off in Differential Privacy

Yaohong Yang, Aki Rehn, Sammie Katt, Antti Honkela, Samuel Kaski

Main category: cs.LG

TL;DR: This paper proposes an efficient method for finding optimal privacy-accuracy trade-offs in differential privacy by leveraging the problem’s unique structure and using more informative user interactions.

Details

Motivation: Standard multi-objective optimization approaches are inefficient for differential privacy because they fail to leverage the problem's unique structure - that points on the Pareto front can be generated directly by maximizing accuracy for fixed privacy levels.

Method: The authors theoretically derive the shape of the privacy-accuracy trade-off to model the Pareto front directly, and replace simple pairwise comparisons with more informative interactions where users select preferred trade-offs from hypothetical curves.

Result: Experiments on differentially private logistic regression and deep transfer learning across six real-world datasets show the method converges to optimal privacy-accuracy trade-offs with significantly less computational cost and user interaction than baselines.

Conclusion: The proposed approach efficiently solves the privacy-accuracy optimization problem in differential privacy by leveraging problem structure and improved user interaction methods, reducing computational burden while maintaining effectiveness.

Abstract: Differential privacy (DP) is the standard for privacy-preserving analysis, and introduces a fundamental trade-off between privacy guarantees and model performance. Selecting the optimal balance is a critical challenge that can be framed as a multi-objective optimization (MOO) problem where one first discovers the set of optimal trade-offs (the Pareto front) and then learns a decision-maker’s preference over them. While a rich body of work on interactive MOO exists, the standard approach – modeling the objective functions with generic surrogates and learning preferences from simple pairwise feedback – is inefficient for DP because it fails to leverage the problem’s unique structure: a point on the Pareto front can be generated directly by maximizing accuracy for a fixed privacy level. Motivated by this property, we first derive the shape of the trade-off theoretically, which allows us to model the Pareto front directly and efficiently. To address inefficiency in preference learning, we replace pairwise comparisons with a more informative interaction. In particular, we present the user with hypothetical trade-off curves and ask them to pick their preferred trade-off. Our experiments on differentially private logistic regression and deep transfer learning across six real-world datasets show that our method converges to the optimal privacy-accuracy trade-off with significantly less computational cost and user interaction than baselines.

[325] A Primer on Causal and Statistical Dataset Biases for Fair and Robust Image Analysis

Charles Jones, Ben Glocker

Main category: cs.LG

TL;DR: The paper analyzes why machine learning fails in real-world applications, particularly in high-stakes domains like medical diagnosis, and introduces two overlooked problems: the “no fair lunch” problem and “subgroup separability” problem.

Details

Motivation: Machine learning methods often fail in real-world, high-stakes applications across socially sensitive lines, creating barriers to adoption in critical domains like medical diagnosis where they could provide significant benefits if safely deployed.

Method: The paper introduces causal and statistical structures that induce failure in ML methods for image analysis, analyzes two specific problems (no fair lunch and subgroup separability), and examines why current fair representation learning methods fail to solve them.

Result: The analysis reveals previously overlooked structural problems in ML deployment and identifies limitations in existing fair representation learning approaches.

Conclusion: The paper proposes potential paths forward for the field to address these fundamental problems and enable safer deployment of machine learning in high-stakes applications.

Abstract: Machine learning methods often fail when deployed in the real world. Worse still, they fail in high-stakes situations and across socially sensitive lines. These issues have a chilling effect on the adoption of machine learning methods in settings such as medical diagnosis, where they are arguably best-placed to provide benefits if safely deployed. In this primer, we introduce the causal and statistical structures which induce failure in machine learning methods for image analysis. We highlight two previously overlooked problems, which we call the \textit{no fair lunch} problem and the \textit{subgroup separability} problem. We elucidate why today’s fair representation learning methods fail to adequately solve them and propose potential paths forward for the field.

[326] Using causal abstractions to accelerate decision-making in complex bandit problems

Joel Dyer, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge, Fabio Massimo Zennaro

Main category: cs.LG

TL;DR: AT-UCB algorithm leverages causal abstraction theory to efficiently explore coarse-grained CMAB instances before applying UCB on optimal actions, reducing cumulative regret compared to classical UCB.

Details

Motivation: Real-world decision-making problems can be encoded as causal multi-armed bandits at different abstraction levels, but existing methods don't exploit the information and computational advantages across these levels.

Method: Proposes AT-UCB algorithm that uses causal abstraction theory to explore cheap-to-simulate coarse-grained CMAB instances first, then applies traditional UCB on a restricted set of potentially optimal actions.

Result: Theoretical analysis shows novel upper bound on cumulative regret. Empirical evaluation on epidemiological simulators with varying resolution demonstrates significant reductions in cumulative regret compared to classical UCB.

Conclusion: AT-UCB effectively exploits shared information between different abstraction levels in CMAB problems, providing both theoretical guarantees and practical performance improvements for decision-making in complex systems.

Abstract: Although real-world decision-making problems can often be encoded as causal multi-armed bandits (CMABs) at different levels of abstraction, a general methodology exploiting the information and computational advantages of each abstraction level is missing. In this paper, we propose AT-UCB, an algorithm which efficiently exploits shared information between CMAB problem instances defined at different levels of abstraction. More specifically, AT-UCB leverages causal abstraction (CA) theory to explore within a cheap-to-simulate and coarse-grained CMAB instance, before employing the traditional upper confidence bound (UCB) algorithm on a restricted set of potentially optimal actions in the CMAB of interest, leading to significant reductions in cumulative regret when compared to the classical UCB algorithm. We illustrate the advantages of AT-UCB theoretically, through a novel upper bound on the cumulative regret, and empirically, by applying AT-UCB to epidemiological simulators with varying resolution and computational cost.

[327] Characteristic Energy Behavior Profiling of Non-Residential Buildings

Haley Dozier, Althea Henslee

Main category: cs.LG

TL;DR: Proposes a data-driven behavioral model to analyze energy usage patterns on US Army installations for climate resilience planning and benchmarking future measures.

Details

Motivation: US Army installations face climate change threats and rely on vulnerable commercial energy infrastructure, requiring resilience measures to protect critical mission-supporting facilities.

Method: Develops behavioral models to analyze, predict, and cluster multimodal energy usage data from non-residential buildings, using open access data due to Army data constraints.

Result: Creates baseline energy usage profiles that can assess disruption impacts and benchmark future resilience measures.

Conclusion: The proposed methodology provides a foundation for understanding energy system vulnerabilities and evaluating resilience strategies for Army installations.

Abstract: Due to the threat of changing climate and extreme weather events, the infrastructure of the United States Army installations is at risk. More than ever, climate resilience measures are needed to protect facility assets that support critical missions and help generate readiness. As most of the Army installations within the continental United States rely on commercial energy and water sources, resilience to the vulnerabilities within independent energy resources (electricity grids, natural gas pipelines, etc) along with a baseline understanding of energy usage within installations must be determined. This paper will propose a data-driven behavioral model to determine behavior profiles of energy usage on installations. These profiles will be used 1) to create a baseline assessment of the impact of unexpected disruptions on energy systems and 2) to benchmark future resiliency measures. In this methodology, individual building behavior will be represented with models that can accurately analyze, predict, and cluster multimodal data collected from energy usage of non-residential buildings. Due to the nature of Army installation energy usage data, similarly structured open access data will be used to illustrate this methodology.

[328] Parking Availability Prediction via Fusing Multi-Source Data with A Self-Supervised Learning Enhanced Spatio-Temporal Inverted Transformer

Yin Huang, Yongqi Dong, Youhua Tang, Li Li

Main category: cs.LG

TL;DR: Proposes SST-iTransformer for parking availability prediction, using multi-source traffic data and self-supervised learning to capture spatio-temporal dependencies, achieving state-of-the-art results with ride-hailing data being most impactful.

Details

Motivation: Address urban parking challenges caused by rapid private car growth by improving parking availability prediction accuracy through better modeling of spatio-temporal dependencies and multi-source data integration.

Method: Uses K-means clustering to create parking cluster zones, integrates multi-source traffic data (metro, bus, ride-hailing, taxi), and develops SST-iTransformer with self-supervised learning and dual-branch attention mechanism (Series Attention for temporal dependencies, Channel Attention for cross-variate interactions).

Result: Outperforms baseline models (Informer, Autoformer, Crossformer, iTransformer) with lowest MSE and competitive MAE. Ride-hailing data provides largest performance gains, followed by taxi data, while fixed-route transit features contribute minimally.

Conclusion: SST-iTransformer effectively captures complex spatio-temporal patterns for parking prediction, with multi-source data integration and spatial correlation modeling being crucial for optimal performance.

Abstract: The rapid growth of private car ownership has worsened the urban parking predicament, underscoring the need for accurate and effective parking availability prediction to support urban planning and management. To address key limitations in modeling spatio-temporal dependencies and exploiting multi-source data for parking availability prediction, this study proposes a novel approach with SST-iTransformer. The methodology leverages K-means clustering to establish parking cluster zones (PCZs), extracting and integrating traffic demand characteristics from various transportation modes (i.e., metro, bus, online ride-hailing, and taxi) associated with the targeted parking lots. Upgraded on vanilla iTransformer, SST-iTransformer integrates masking-reconstruction-based pretext tasks for self-supervised spatio-temporal representation learning, and features an innovative dual-branch attention mechanism: Series Attention captures long-term temporal dependencies via patching operations, while Channel Attention models cross-variate interactions through inverted dimensions. Extensive experiments using real-world data from Chengdu, China, demonstrate that SST-iTransformer outperforms baseline deep learning models (including Informer, Autoformer, Crossformer, and iTransformer), achieving state-of-the-art performance with the lowest mean squared error (MSE) and competitive mean absolute error (MAE). Comprehensive ablation studies quantitatively reveal the relative importance of different data sources: incorporating ride-hailing data provides the largest performance gains, followed by taxi, whereas fixed-route transit features (bus/metro) contribute marginally. Spatial correlation analysis further confirms that excluding historical data from correlated parking lots within PCZs leads to substantial performance degradation, underscoring the importance of modeling spatial dependencies.

[329] When three experiments are better than two: Avoiding intractable correlated aleatoric uncertainty by leveraging a novel bias–variance tradeoff

Paul Scherer, Andreas Kirsch, Jake P. Taylor-King

Main category: cs.LG

TL;DR: Novel active learning strategies that reduce bias between experimental rounds by leveraging bias-variance tradeoff and aleatoric uncertainty, with a cobias-covariance relationship for batching that outperforms canonical methods.

Details

Motivation: Real-world experiments have heteroskedastic aleatoric uncertainty that can be correlated in batched settings, requiring better active learning approaches that account for both epistemic and aleatoric uncertainty.

Method: Proposes active learning strategies based on bias-variance tradeoff to reduce bias between experimental rounds. Introduces cobias-covariance relationship for historical data utilization and eigendecomposition strategy for batching. Uses difference-based method with quadratic estimator.

Result: The proposed method outperforms canonical active learning methods including BALD and Least Confidence when used in batched settings with quadratic estimator.

Conclusion: The novel approach effectively addresses heteroskedastic aleatoric uncertainty in batched experimental settings and provides superior performance compared to existing active learning methods through bias reduction and optimal batching strategies.

Abstract: Real-world experimental scenarios are characterized by the presence of heteroskedastic aleatoric uncertainty, and this uncertainty can be correlated in batched settings. The bias–variance tradeoff can be used to write the expected mean squared error between a model distribution and a ground-truth random variable as the sum of an epistemic uncertainty term, the bias squared, and an aleatoric uncertainty term. We leverage this relationship to propose novel active learning strategies that directly reduce the bias between experimental rounds, considering model systems both with and without noise. Finally, we investigate methods to leverage historical data in a quadratic manner through the use of a novel cobias–covariance relationship, which naturally proposes a mechanism for batching through an eigendecomposition strategy. When our difference-based method leveraging the cobias–covariance relationship is utilized in a batched setting (with a quadratic estimator), we outperform a number of canonical methods including BALD and Least Confidence.

[330] Towards a Unified View of Large Language Model Post-Training

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou

Main category: cs.LG

TL;DR: The paper presents a unified theoretical framework showing that RL and SFT approaches are instances of a single optimization process, and proposes Hybrid Post-Training (HPT) that dynamically selects training signals to exploit demonstrations while maintaining stable exploration.

Details

Motivation: To bridge the gap between online (model-generated) and offline (human demonstrations) training data approaches, showing they are not contradictory but part of a unified optimization framework.

Method: Derived a Unified Policy Gradient Estimator with four interchangeable parts, and proposed HPT algorithm that dynamically selects different training signals based on theoretical findings.

Result: HPT consistently surpassed strong baselines across six mathematical reasoning benchmarks and two out-of-distribution suites, working effectively across models of varying scales and families.

Conclusion: The unified framework provides theoretical grounding for hybrid approaches, and HPT demonstrates practical effectiveness in balancing exploitation of demonstrations with stable exploration while preserving learned reasoning patterns.

Abstract: Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

[331] PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae

Main category: cs.LG

TL;DR: PagedEviction is a novel KV cache pruning strategy that improves memory efficiency in LLM inference by using block-wise eviction tailored for paged memory layouts, maintaining accuracy while reducing memory usage.

Details

Motivation: KV caching improves LLM inference efficiency but becomes a major memory bottleneck as sequence length increases, requiring better memory management solutions.

Method: Proposes PagedEviction, a fine-grained structured KV cache pruning strategy that uses block-wise eviction algorithm designed for paged memory layouts, seamlessly integrating with vLLM’s PagedAttention without modifying CUDA kernels.

Result: Evaluation on Llama models using LongBench benchmark shows improved memory usage with better accuracy than baselines on long context tasks.

Conclusion: PagedEviction effectively addresses KV cache memory bottlenecks while maintaining inference accuracy, providing a practical solution for efficient long-context LLM processing.

Abstract: KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM’s PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

[332] Transition Models: Rethinking the Generative Learning Objective

Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, Lei Bai

Main category: cs.LG

TL;DR: TiM introduces a continuous-time dynamics equation for finite time interval transitions, enabling adaptive sampling from single leaps to fine-grained refinement with monotonic quality improvement as steps increase.

Details

Motivation: Address the fundamental dilemma between iterative diffusion models' high fidelity but high computational cost vs. few-step alternatives' efficiency but quality ceiling, caused by restrictive training objectives.

Method: Develops an exact continuous-time dynamics equation that analytically defines state transitions across any finite time interval, creating Transition Models (TiM) that adapt to arbitrary-step transitions.

Result: TiM with 865M parameters achieves SOTA performance, surpassing SD3.5 (8B) and FLUX.1 (12B) across all step counts, with monotonic quality improvement as sampling budget increases and exceptional fidelity at resolutions up to 4096x4096.

Conclusion: TiM provides a novel generative paradigm that resolves the step-quality conflict, offering both efficiency and high fidelity with adaptive sampling capabilities.

Abstract: A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.

[333] Delta Activations: A Representation for Finetuned Large Language Models

Zhiqiu Xu, Amish Sethi, Mayur Naik, Ser-Nam Lim

Main category: cs.LG

TL;DR: Delta Activations method represents finetuned LLMs as vector embeddings by measuring activation shifts from base models, enabling effective clustering by domain/task and showing robustness across finetuning settings.

Details

Motivation: The proliferation of open-source LLMs has created challenges in navigating and understanding the vast collection of post-trained models due to inconsistent metadata and unstructured repositories.

Method: Measure shifts in internal activations of finetuned models relative to a base model to create vector embeddings that capture model characteristics and relationships.

Result: Delta Activations enable effective clustering by domain and task, demonstrate robustness across finetuning settings, exhibit additive properties with mixed datasets, and can embed tasks via few-shot finetuning.

Conclusion: Delta Activations provide a structured way to represent and understand finetuned models, facilitating model reuse and exploration of applications like model selection and merging.

Abstract: The success of powerful open source Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories. We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape. Delta Activations also demonstrate desirable properties: it is robust across finetuning settings and exhibits an additive property when finetuning datasets are mixed. In addition, we show that Delta Activations can embed tasks via few-shot finetuning, and further explore its use for model selection and merging. We hope Delta Activations can facilitate the practice of reusing publicly available models. Code is available at https://github.com/OscarXZQ/delta_activations.

[334] IPA: An Information-Preserving Input Projection Framework for Efficient Foundation Model Adaptation

Yuan Yin, Shashanka Venkataramanan, Tuan-Hung Vu, Andrei Bursuc, Matthieu Cord

Main category: cs.LG

TL;DR: IPA improves LoRA by replacing random down-projection with feature-aware projection that preserves information using top principal components, achieving better performance with fewer parameters.

Details

Motivation: LoRA's random initialization of down-projection discards useful information and becomes a performance bottleneck, as it changes little during training while up-projection does most adaptation work.

Method: Propose IPA framework that uses feature-aware projection to preserve information in reduced hidden space, instantiated with algorithms approximating top principal components for efficient pretraining.

Result: Consistently outperforms LoRA and DoRA with 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, matches full LoRA performance with half the trainable parameters when projection is frozen.

Conclusion: IPA provides a more effective parameter-efficient fine-tuning approach by addressing LoRA’s information loss problem through feature-aware projections, achieving superior performance with reduced parameter count.

Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA’s down-projection is randomly initialized and data-agnostic, discarding potentially useful information. Prior analyses show that this projection changes little during training, while the up-projection carries most of the adaptation, making the random input compression a performance bottleneck. We propose IPA, a feature-aware projection framework that explicitly preserves information in the reduced hidden space. In the linear case, we instantiate IPA with algorithms approximating top principal components, enabling efficient projector pretraining with negligible inference overhead. Across language and vision benchmarks, IPA consistently improves over LoRA and DoRA, achieving on average 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, while matching full LoRA performance with roughly half the trainable parameters when the projection is frozen.

[335] Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data

Wenrui Li, Qinghao Zhang, Xiaowo Wang

Main category: cs.LG

TL;DR: HCL is an unsupervised framework that jointly infers latent clusters and their causal structures from mixed-type observational data without requiring prior knowledge like temporal ordering or interventions.

Details

Motivation: Existing methods lack causal awareness and struggle with modeling heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations.

Method: HCL introduces an equivalent representation encoding structural heterogeneity and confounding, uses bi-directional iterative strategy to refine causal clustering and structure learning, and employs self-supervised regularization to balance cross-cluster universality and specificity.

Result: Theoretically shows identifiability of heterogeneous causal structures under mild conditions. Empirically achieves superior performance in clustering and structure learning, and recovers biologically meaningful mechanisms in single-cell perturbation data.

Conclusion: HCL enables convergence toward interpretable, heterogeneous causal patterns and demonstrates utility for discovering interpretable, mechanism-level causal heterogeneity in scientific domains like biology and medicine.

Abstract: Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.

[336] Echo State Networks as State-Space Models: A Systems Perspective

Pradeep Singh, Balasubramanian Raman

Main category: cs.LG

TL;DR: Reformulates Echo State Networks as state-space models, providing theoretical foundations for stability analysis, linearization techniques, and improved training methods with Kalman filtering.

Details

Motivation: To establish first-principles theoretical foundations for ESNs by recasting them as state-space models, moving beyond heuristic design approaches and linking reservoir computing with classical system identification.

Method: Develops unified systems-theoretic framework showing echo-state property as input-to-state stability, creates small-signal linearizations and Koopman feature expansions, and proposes Kalman/EKF-assisted readout learning with EM for hyperparameter optimization.

Result: Provides verifiable stability conditions, enables frequency-domain memory analysis, clarifies when ESNs emulate structured SSM kernels, and offers improved training procedures with spectral shaping capabilities.

Conclusion: The state-space perspective unifies reservoir computing with classical and modern system identification, providing rigorous theoretical foundations and practical improvements for ESN design and training.

Abstract: Echo State Networks (ESNs) are typically presented as efficient, readout-trained recurrent models, yet their dynamics and design are often guided by heuristics rather than first principles. We recast ESNs explicitly as state-space models (SSMs), providing a unified systems-theoretic account that links reservoir computing with classical identification and modern kernelized SSMs. First, we show that the echo-state property is an instance of input-to-state stability for a contractive nonlinear SSM and derive verifiable conditions in terms of leak, spectral scaling, and activation Lipschitz constants. Second, we develop two complementary mappings: (i) small-signal linearizations that yield locally valid LTI SSMs with interpretable poles and memory horizons; and (ii) lifted/Koopman random-feature expansions that render the ESN a linear SSM in an augmented state, enabling transfer-function and convolutional-kernel analyses. This perspective yields frequency-domain characterizations of memory spectra and clarifies when ESNs emulate structured SSM kernels. Third, we cast teacher forcing as state estimation and propose Kalman/EKF-assisted readout learning, together with EM for hyperparameters (leak, spectral radius, process/measurement noise) and a hybrid subspace procedure for spectral shaping under contraction constraints.

[337] Unveiling the Role of Data Uncertainty in Tabular Deep Learning

Nikolay Kartashev, Ivan Rubachev, Artem Babenko

Main category: cs.LG

TL;DR: The paper explains that data uncertainty is the key reason behind the success of modern tabular deep learning methods, showing how techniques like numerical embeddings and ensembling implicitly manage uncertainty.

Details

Motivation: To address the lack of understanding about why recent tabular deep learning techniques perform so well in practice despite limited theoretical explanations.

Method: Analyzes how beneficial design choices in tabular DL (numerical feature embeddings, retrieval-augmented models, advanced ensembling) implicitly handle data uncertainty, and develops improved numerical embeddings based on these insights.

Result: Provides a unifying explanation for recent performance improvements in tabular DL and develops more effective numerical feature embeddings as a practical outcome.

Conclusion: The work establishes a foundational understanding of modern tabular methods, leads to concrete advancements in existing techniques, and outlines future research directions for tabular deep learning.

Abstract: Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data uncertainty for explaining the effectiveness of the recent tabular DL methods. In particular, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, retrieval-augmented models and advanced ensembling strategies, can be largely attributed to their implicit mechanisms for managing high data uncertainty. By dissecting these mechanisms, we provide a unifying understanding of the recent performance improvements. Furthermore, the insights derived from this data-uncertainty perspective directly allowed us to develop more effective numerical feature embeddings as an immediate practical outcome of our analysis. Overall, our work paves the way to foundational understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.

[338] Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment

Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong

Main category: cs.LG

TL;DR: The paper proposes an axiomatic approach to learn cognitively faithful decision models from pairwise comparisons that better capture human heuristic processes, addressing limitations of standard preference elicitation methods.

Details

Motivation: Standard preference elicitation methods fail to capture true human cognitive processes like heuristics, resulting in models that don't align with human decision-making and lack generalization capabilities.

Method: An axiomatic approach defining models where individual features are processed and compared across alternatives first, then aggregated via fixed rules like Bradley-Terry, ensuring realistic representation of human decision processes.

Result: The proposed models match or surpass prior models’ accuracy in human pairwise decision-making tasks, demonstrated effectively in a kidney allocation scenario.

Conclusion: The structured processing approach provides more interpretable and cognitively faithful models of human decision-making that better capture underlying cognitive processes compared to standard methods.

Abstract: Recent AI work trends towards incorporating human-centric objectives, with the explicit goal of aligning AI models to personal preferences and societal values. Using standard preference elicitation methods, researchers and practitioners build models of human decisions and judgments, which are then used to align AI behavior with that of humans. However, models commonly used in such elicitation processes often do not capture the true cognitive processes of human decision making, such as when people use heuristics to simplify information associated with a decision problem. As a result, models learned from people’s decisions often do not align with their cognitive processes, and can not be used to validate the learning framework for generalization to other decision-making tasks. To address this limitation, we take an axiomatic approach to learning cognitively faithful decision processes from pairwise comparisons. Building on the vast literature characterizing the cognitive processes that contribute to human decision-making, and recent work characterizing such processes in pairwise comparison tasks, we define a class of models in which individual features are first processed and compared across alternatives, and then the processed features are then aggregated via a fixed rule, such as the Bradley-Terry rule. This structured processing of information ensures such models are realistic and feasible candidates to represent underlying human decision-making processes. We demonstrate the efficacy of this modeling approach in learning interpretable models of human decision making in a kidney allocation task, and show that our proposed models match or surpass the accuracy of prior models of human pairwise decision-making.

[339] ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset

Adrian Catalin Lutu, Ioana Pintilie, Elena Burceanu, Andrei Manolache

Main category: cs.LG

TL;DR: ChronoGraph is a new dataset for multivariate time series forecasting with graph structure, built from real production microservices with performance metrics and service dependencies, including anomaly labels for incident evaluation.

Details

Motivation: Existing benchmarks lack the combination of multivariate time series, explicit dependency graphs, and real incident annotations needed for realistic microservice system analysis.

Method: Built dataset from real-world production microservices where nodes are services emitting performance metrics (CPU, memory, network) and edges encode service dependencies, with expert-annotated incident windows.

Result: Provides a unique benchmark combining multivariate time series, machine-readable dependency graph, and real anomaly labels, with baseline results for forecasting models and anomaly detectors.

Conclusion: ChronoGraph offers a realistic benchmark for structure-aware forecasting and incident-aware evaluation in microservice systems, filling a gap in existing datasets.

Abstract: We present ChronoGraph, a graph-structured multivariate time series forecasting dataset built from real-world production microservices. Each node is a service that emits a multivariate stream of system-level performance metrics, capturing CPU, memory, and network usage patterns, while directed edges encode dependencies between services. The primary task is forecasting future values of these signals at the service level. In addition, ChronoGraph provides expert-annotated incident windows as anomaly labels, enabling evaluation of anomaly detection methods and assessment of forecast robustness during operational disruptions. Compared to existing benchmarks from industrial control systems or traffic and air-quality domains, ChronoGraph uniquely combines (i) multivariate time series, (ii) an explicit, machine-readable dependency graph, and (iii) anomaly labels aligned with real incidents. We report baseline results spanning forecasting models, pretrained time-series foundation models, and standard anomaly detectors. ChronoGraph offers a realistic benchmark for studying structure-aware forecasting and incident-aware evaluation in microservice systems.

[340] Diffusion on language model encodings for protein sequence generation

Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov

Main category: cs.LG

TL;DR: DiMA is a continuous latent diffusion framework for protein sequence design that operates on protein language model representations, achieving high performance across multiple protein encoders and demonstrating versatile conditional generation capabilities.

Details

Motivation: Protein sequence design has advanced with discrete diffusion and autoregressive methods, but continuous diffusion approaches remain underexplored despite their potential benefits.

Method: Developed DiMA, a latent diffusion framework that uses protein language model representations. Systematically explored architectural choices and diffusion components to create a robust methodology that generalizes across protein encoders from 8M to 3B parameters.

Result: DiMA achieves consistently high performance across sequence-only, dual-decodable, and multimodal representations using the same architecture. It produces novel, high-quality, diverse protein sequences and outperforms baselines including autoregressive, discrete diffusion, and flow matching language models.

Conclusion: DiMA provides a universal continuous diffusion framework for protein sequence generation that offers both architectural insights and practical applicability across various protein design scenarios, supporting conditional generation tasks like protein family-generation, motif scaffolding, and fold-specific sequence design.

Abstract: Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios.

[341] Unisolver: PDE-Conditional Transformers Towards Universal Neural PDE Solvers

Hang Zhou, Yuezhou Ma, Haixu Wu, Haowen Wang, Mingsheng Long

Main category: cs.LG

TL;DR: Unisolver is a Transformer-based universal neural PDE solver that uses domain-wise and point-wise conditioning on PDE components to achieve state-of-the-art performance across diverse PDE types.

Details

Motivation: Existing neural PDE solvers are limited to specific PDE instances with restricted coefficients, lacking generalization to diverse PDEs needed for practical surrogate modeling of numerical solvers.

Method: Defines complete set of PDE components (equation symbols, boundary conditions) and embeds them as domain-wise and point-wise conditions for Transformer architecture, integrating physical insights with Transformer advances.

Result: Achieves consistent state-of-the-art performance on three challenging large-scale benchmarks, demonstrating impressive performance and generalizability across diverse PDEs.

Conclusion: Unisolver represents a significant step towards universal neural PDE solvers by combining theoretical PDE analysis with modern Transformer architecture and conditioning techniques.

Abstract: Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available at https://github.com/thuml/Unisolver.

[342] Long Input Sequence Network for Long Time Series Forecasting

Chao Ma, Yikai Hou, Xiang Li, Yinggang Sun, Haining Yu

Main category: cs.LG

TL;DR: MTPR addresses overfitting in long time-series forecasting by decoupling multi-scale patterns using period-specific token sizes, enabling 10x longer inputs with 38% precision improvement and 0.22x cost.

Details

Motivation: Short fixed-length inputs cause overfitting in long time-series forecasting. Current models struggle with multi-scale pattern coupling and fixed focusing scales, leading to accuracy deterioration.

Method: Introduces series-decomposition module (MPSD) and Multi-Token Pattern Recognition neural network (MTPR) to decouple multi-scale temporal patterns, using period-specific token sizes for each pattern.

Result: Enables handling inputs up to 10x longer, achieves 38% maximum precision improvement, reduces computational cost to 0.22x, and provides high interpretability.

Conclusion: Decoupling multi-scale patterns with period-specific token sizes effectively solves overfitting in long time-series forecasting while improving performance and reducing complexity.

Abstract: Short fixed-length inputs are the main bottleneck of deep learning methods in long time-series forecasting tasks. Prolonging input length causes overfitting, rapidly deteriorating accuracy. Our research indicates that the overfitting is a combination reaction of the multi-scale pattern coupling in time series and the fixed focusing scale of current models. First, we find that the patterns exhibited by a time series across various scales are reflective of its multi-periodic nature, where each scale corresponds to specific period length. Second, We find that the token size predominantly dictates model behavior, as it determines the scale at which the model focuses and the context size it can accommodate. Our idea is to decouple the multi-scale temporal patterns of time series and to model each pattern with its corresponding period length as token size. We introduced a novel series-decomposition module(MPSD), and a Multi-Token Pattern Recognition neural network(MTPR), enabling the model to handle \textit{inputs up to $10\times$ longer}. Sufficient context enhances performance(\textit{38% maximum precision improvement}), and the decoupling approach offers \textit{Low complexity($0.22\times$ cost)} and \textit{high interpretability}.

[343] Robust training of implicit generative models for multivariate and heavy-tailed distributions with an invariant statistical loss

José Manuel de Frutos, Manuel A. Vázquez, Pablo Olmos, Joaquín Míguez

Main category: cs.LG

TL;DR: Pareto-ISL extends invariant statistical loss to handle heavy-tailed and multivariate data distributions using generalized Pareto distribution noise and random projections.

Details

Motivation: Traditional implicit generative models struggle with unstable training, mode dropping, and capturing heavy-tailed distributions. Real-world data often requires heavy-tailed distributions for proper characterization.

Method: Extends ISL with GPD input noise for heavy-tailed data, and uses random projections for multivariate data to create tractable loss functions for high-dimensional spaces.

Result: Pareto-ISL accurately models distribution tails while capturing central characteristics, shows robustness across hyperparameters, and prevents mode collapse when used as GAN pretraining.

Conclusion: The proposed Pareto-ISL method effectively handles both heavy-tailed and multivariate data distributions, overcoming limitations of traditional implicit generative models with stable training and improved performance.

Abstract: Traditional implicit generative models are capable of learning highly complex data distributions. However, their training involves distinguishing real data from synthetically generated data using adversarial discriminators, which can lead to unstable training dynamics and mode dropping issues. In this work, we build on the \textit{invariant statistical loss} (ISL) method introduced in \cite{de2024training}, and extend it to handle heavy-tailed and multivariate data distributions. The data generated by many real-world phenomena can only be properly characterised using heavy-tailed probability distributions, and traditional implicit methods struggle to effectively capture their asymptotic behavior. To address this problem, we introduce a generator trained with ISL, that uses input noise from a generalised Pareto distribution (GPD). We refer to this generative scheme as Pareto-ISL for conciseness. Our experiments demonstrate that Pareto-ISL accurately models the tails of the distributions while still effectively capturing their central characteristics. The original ISL function was conceived for 1D data sets. When the actual data is $n$-dimensional, a straightforward extension of the method was obtained by targeting the $n$ marginal distributions of the data. This approach is computationally infeasible and ineffective in high-dimensional spaces. To overcome this, we extend the 1D approach using random projections and define a new loss function suited for multivariate data, keeping problems tractable by adjusting the number of projections. We assess its performance in multidimensional generative modeling and explore its potential as a pretraining technique for generative adversarial networks (GANs) to prevent mode collapse, reporting promising results and highlighting its robustness across various hyperparameter settings.

[344] Quantifying Calibration Error in Neural Networks Through Evidence-Based Theory

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Frank Kargl

Main category: cs.LG

TL;DR: Novel framework using subjective logic to quantify neural network trustworthiness by enhancing Expected Calibration Error evaluation, providing measures of trust, disbelief, and uncertainty through probability clustering and opinion fusion.

Details

Motivation: Traditional performance metrics like accuracy and precision fail to capture reliability, confidence, and uncertainty aspects in neural networks, especially when models exhibit overconfidence, which is critical for deployment in sensitive applications.

Method: Incorporates subjective logic into Expected Calibration Error (ECE) evaluation by clustering predicted probabilities and fusing opinions using appropriate fusion operators to provide comprehensive trustworthiness measures.

Result: Experiments on MNIST and CIFAR-10 datasets show improved trustworthiness through post-calibration results, demonstrating the framework’s effectiveness.

Conclusion: The proposed framework offers a more interpretable and nuanced assessment of AI models’ trustworthiness, with significant potential applications in sensitive domains like healthcare and autonomous systems.

Abstract: Trustworthiness in neural networks is crucial for their deployment in critical applications, where reliability, confidence, and uncertainty play pivotal roles in decision-making. Traditional performance metrics such as accuracy and precision fail to capture these aspects, particularly in cases where models exhibit overconfidence. To address these limitations, this paper introduces a novel framework for quantifying the trustworthiness of neural networks by incorporating subjective logic into the evaluation of Expected Calibration Error (ECE). This method provides a comprehensive measure of trust, disbelief, and uncertainty by clustering predicted probabilities and fusing opinions using appropriate fusion operators. We demonstrate the effectiveness of this approach through experiments on MNIST and CIFAR-10 datasets, where post-calibration results indicate improved trustworthiness. The proposed framework offers a more interpretable and nuanced assessment of AI models, with potential applications in sensitive domains such as healthcare and autonomous systems.

[345] Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance

Antoine Grosnit, Alexandre Maraval, Refinath S N, Zichao Zhao, James Dora, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Jun Wang

Main category: cs.LG

TL;DR: Agent K is an AI system that implements Kolb’s experiential learning cycle and Vygotsky’s ZPD framework, achieving human-level performance in data science competitions by autonomously learning and improving through structured interaction and reflection.

Details

Motivation: Current AI systems lack mechanisms for continual adaptation and human-like experiential learning, despite showing early cognitive traits. The paper aims to design LLM agents capable of structured, cognitively grounded learning similar to human processes.

Method: Proposes a computational framework separating extrinsic (environment interaction) and intrinsic (internal reflection/abstraction) functions, enabling scaffolded learning where agents initially learn within structured environments followed by open-ended generalization.

Result: Agent K achieved an Elo-MMR score of 1694, surpassing the median score of Kaggle Masters (top 2% of 200,000 users), with 9 gold, 8 silver, and 12 bronze medals including 4 gold and 4 silver in prize-awarding competitions across 81 data science tasks.

Conclusion: Agent K is the first AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning, representing a major step toward generalist AI capable of autonomous learning and complex task mastery.

Abstract: Human expertise emerges through iterative cycles of interaction, reflection, and internal model updating, which are central to cognitive theories such as Kolb’s experiential learning and Vygotsky’s zone of proximal development. In contrast, current AI systems, particularly LLM agents, rely on static pre-training or rigid workflows, lacking mechanisms for continual adaptation. Recent studies identified early cognitive traits in LLM agents (reflection, revision, and self-correction) suggesting foundational elements of human-like experiential learning. Thus the key question: Can we design LLM agents capable of structured, cognitively grounded learning similar to human processes? In response, we propose a computational framework of Kolb’s learning cycle with Vygotsky’s ZPD for autonomous agents. Our architecture separates extrinsic (environment interaction) and intrinsic (internal reflection/abstraction) functions, enabling cognitively grounded scaffolded learning, where the agent initially learns within structured environments, followed by open-ended generalisation. This approach empowers agents to master complex tasks ; domains that traditional fine-tuning or simple reflective methods could not tackle effectively. Its potential is powerfully demonstrated via direct comparison with humans in real-world Kaggle data science competitions. Learning fully automated data science code generation across 81 tasks, our system, Agent K, demonstrated the ability to perform the entire workflow autonomously, achieving an Elo-MMR score of 1694, beyond median score of the Kaggle Masters (the top 2% among 200,000 users) of our study. With 9 gold, 8 silver, and 12 bronze medals level performance - including 4 gold and 4 silver on prize-awarding competitions - Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning, marking a major step toward generalist AI.

[346] Breaking the Context Bottleneck on Long Time Series Forecasting

Chao Ma, Yikai Hou, Xiang Li, Yinggang Sun, Haining Yu, Zhou Fang, Jiaxing Qu

Main category: cs.LG

TL;DR: LDM framework uses multiscale decomposition to improve long-term time-series forecasting by separating patterns at different scales, reducing non-stationarity and improving efficiency.

Details

Motivation: Long-term forecasting requires efficient processing of long sequences, but current models tend to overfit with extended inputs, limiting their effectiveness.

Method: Proposes Logsparse Decomposable Multiscaling (LDM) framework that decouples patterns at different scales in time series to reduce non-stationarity and improve efficiency.

Result: LDM outperforms all baselines in long-term forecasting benchmarks while reducing both training time and memory costs.

Conclusion: Multiscale modeling through pattern decoupling enhances predictability, efficiency, and architectural simplicity for long sequence processing.

Abstract: Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation, where long foresight is required. To obtain such long foresight, models must be both efficient and effective in processing long sequence. Recent advancements have enhanced the efficiency of these models; however, the challenge of effectively leveraging longer sequences persists. This is primarily due to the tendency of these models to overfit when presented with extended inputs, necessitating the use of shorter input lengths to maintain tolerable error margins. In this work, we investigate the multiscale modeling method and propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences. We demonstrate that by decoupling patterns at different scales in time series, we can enhance predictability by reducing non-stationarity, improve efficiency through a compact long input representation, and simplify the architecture by providing clear task assignments. Experimental results demonstrate that LDM not only outperforms all baselines in long-term forecasting benchmarks, but also reducing both training time and memory costs.

[347] Extended Histogram-based Outlier Score (EHBOS)

Tanvir Islam

Main category: cs.LG

TL;DR: EHBOS extends HBOS by adding 2D histograms to capture feature dependencies, improving anomaly detection performance especially when feature interactions matter.

Details

Motivation: HBOS assumes feature independence, which limits its ability to detect anomalies in datasets where feature interactions are critical for identifying outliers.

Method: Extended HBOS (EHBOS) incorporates two-dimensional histograms to capture dependencies between feature pairs, enabling detection of contextual and dependency-driven anomalies.

Result: EHBOS outperforms HBOS on several benchmark datasets, particularly those where feature interactions define anomaly structure, with notable ROC AUC improvements across 17 datasets.

Conclusion: EHBOS provides a valuable extension to HBOS that can model complex feature dependencies, offering a powerful tool for detecting contextual and relational anomalies.

Abstract: Histogram-Based Outlier Score (HBOS) is a widely used outlier or anomaly detection method known for its computational efficiency and simplicity. However, its assumption of feature independence limits its ability to detect anomalies in datasets where interactions between features are critical. In this paper, we propose the Extended Histogram-Based Outlier Score (EHBOS), which enhances HBOS by incorporating two-dimensional histograms to capture dependencies between feature pairs. This extension allows EHBOS to identify contextual and dependency-driven anomalies that HBOS fails to detect. We evaluate EHBOS on 17 benchmark datasets, demonstrating its effectiveness and robustness across diverse anomaly detection scenarios. EHBOS outperforms HBOS on several datasets, particularly those where feature interactions are critical in defining the anomaly structure, achieving notable improvements in ROC AUC. These results highlight that EHBOS can be a valuable extension to HBOS, with the ability to model complex feature dependencies. EHBOS offers a powerful new tool for anomaly detection, particularly in datasets where contextual or relational anomalies play a significant role.

[348] Reservoir kernels and Volterra series

Lukas Gonon, Lyudmila Grigoryeva, Juan-Pablo Ortega

Main category: cs.LG

TL;DR: A universal kernel called Volterra reservoir kernel is constructed to approximate causal time-invariant filters with fading memory. It uses reservoir functionals from Volterra series state-space representations and provides computable recursions for practical applications.

Details

Motivation: To create a universal kernel that can approximate any causal and time-invariant filter in the fading memory category, addressing the need for effective sequential learning tools for nonlinear systems with finite-dimensional inputs and outputs.

Method: The kernel is built using reservoir functionals associated with a state-space representation of the Volterra series expansion for analytic fading memory filters. It operates on infinite-dimensional tensor algebra space but provides explicit computable recursions.

Result: The Volterra reservoir kernel demonstrates empirical performance in multidimensional nonlinear learning tasks, specifically tested on conditional covariances of financial asset returns, showing competitive results compared to standard static and sequential kernels.

Conclusion: The Volterra reservoir kernel provides a theoretically grounded and practically computable universal kernel for approximating causal time-invariant filters with fading memory, offering promising performance in complex nonlinear sequential learning applications.

Abstract: A universal kernel is constructed whose sections approximate any causal and time-invariant filter in the fading memory category with inputs and outputs in a finite-dimensional Euclidean space. This kernel is built using the reservoir functional associated with a state-space representation of the Volterra series expansion available for any analytic fading memory filter, and it is hence called the Volterra reservoir kernel. Even though the state-space representation and the corresponding reservoir feature map are defined on an infinite-dimensional tensor algebra space, the kernel map is characterized by explicit recursions that are readily computable for specific data sets when employed in estimation problems using the representer theorem. The empirical performance of the Volterra reservoir kernel is showcased and compared to other standard static and sequential kernels in a multidimensional and highly nonlinear learning task for the conditional covariances of financial asset returns.

[349] Towards Robust Graph Structural Learning Beyond Homophily via Preserving Neighbor Similarity

Yulin Zhu, Yuni Lai, Xing Ai, Wai Lun LO, Gaolei Li, Jianhua Li, Di Tang, Xingxing Zhang, Mengpei Yang, Kai Zhou

Main category: cs.LG

TL;DR: This paper explores the vulnerability of graph-based learning systems to adversarial attacks on both homophilic and heterophilic graphs, proposing a novel robust graph structural learning strategy with dual-kNN graph construction to preserve neighbor similarities.

Details

Motivation: While graph-based learning systems are known to be fragile to adversarial attacks on homophilic graphs, their security on heterophilic graphs remains unexplored. The paper aims to bridge this gap and develop robust methods that work across different homophily degrees.

Method: The authors propose a robust graph structural learning strategy with dual-kNN graph construction pipeline that supervises neighbor-similarity-preserved propagation. The graph convolutional layer adaptively smooths or discriminates node pair features based on local structures.

Result: The proposed method can mine better topology of raw graph data under diverse graph homophily levels and achieve more reliable data management on both homophilic and heterophilic graphs.

Conclusion: The paper provides a theoretical foundation and practical solution for enhancing adversarial robustness of graph-based learning systems regardless of homophily degree, addressing a previously unexplored vulnerability in heterophilic graph settings.

Abstract: Despite the tremendous success of graph-based learning systems in handling structural data, it has been widely investigated that they are fragile to adversarial attacks on homophilic graph data, where adversaries maliciously modify the semantic and topology information of the raw graph data to degrade the predictive performances. Motivated by this, a series of robust models are crafted to enhance the adversarial robustness of graph-based learning systems on homophilic graphs. However, the security of graph-based learning systems on heterophilic graphs remains a mystery to us. To bridge this gap, in this paper, we start to explore the vulnerability of graph-based learning systems regardless of the homophily degree, and theoretically prove that the update of the negative classification loss is negatively correlated with the pairwise similarities based on the powered aggregated neighbor features. The theoretical finding inspires us to craft a novel robust graph structural learning strategy that serves as a useful graph mining module in a robust model that incorporates a dual-kNN graph constructions pipeline to supervise the neighbor-similarity-preserved propagation, where the graph convolutional layer adaptively smooths or discriminates the features of node pairs according to their affluent local structures. In this way, the proposed methods can mine the ``better” topology of the raw graph data under diverse graph homophily and achieve more reliable data management on homophilic and heterophilic graphs.

[350] Explaining Length Bias in LLM-Based Preference Evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong

Main category: cs.LG

TL;DR: LLM preference evaluations show bias toward longer responses. The paper decomposes win rate into length-independent desirability and length-dependent information mass, proposing AdapAlpaca to align response lengths for fair comparisons.

Details

Motivation: Large language models used as judges in preference comparisons exhibit significant bias favoring longer responses, which undermines the reliability of these evaluations.

Method: Decompose preference evaluation metrics into desirability (length-independent quality) and information mass (length-dependent content). Propose AdapAlpaca, which aligns reference and test model response lengths under equivalent intervals to ensure fair comparisons.

Result: Empirical experiments demonstrate that response length impacts evaluations primarily through information mass. The decomposition approach successfully isolates length effects from content quality assessment.

Conclusion: AdapAlpaca provides a reliable evaluation metric that assesses content quality without being confounded by response length, addressing the bias issue in LLM preference judgments.

Abstract: The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

[351] Moco: A Learnable Meta Optimizer for Combinatorial Optimization

Tim Dernedde, Daniela Thyssens, Sören Dittrich, Maximilian Stubbemann, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: Moco is a learnable meta-optimizer that uses a neural network to update a continuous heatmap vector for solving combinatorial optimization problems, achieving state-of-the-art performance on TSP and MIS without problem-specific heuristics.

Details

Motivation: Traditional approaches for NP-hard combinatorial optimization problems rely on handcrafted heuristics. While neural networks can learn heuristics from data, existing methods struggle to improve solutions during inference. Moco aims to enable continuous improvement of solutions at inference time through learnable heatmap updates.

Method: Moco uses a lightweight solution construction procedure guided by a continuous heatmap vector θ. A neural network updates θ based on current search state features. The training is budget-aware, targeting the best solution found during the entire search process without requiring optimal solutions.

Result: Moco significantly outperforms other heatmap-based methods on both Traveling Salesman Problem (TSP) and Maximum Independent Set (MIS) benchmarks.

Conclusion: Moco presents a fully learnable meta-optimizer approach that eliminates the need for problem-specific heuristics or optimal training solutions, demonstrating strong performance on challenging combinatorial optimization problems.

Abstract: Relevant combinatorial optimization problems (COPs) are often NP-hard. While they have been tackled mainly via handcrafted heuristics in the past, advances in neural networks have motivated the development of general methods to learn heuristics from data. Many approaches utilize a neural network to directly construct a solution, but are limited in further improving based on already constructed solutions at inference time. Our approach, Moco, defines a lightweight solution construction procedure, guided by a single continuous vector $\theta$ (called heatmap) and learns a neural network to update $\theta$ for a single instance of a COP at inference time. The update is based on various features of the current search state. The training procedure is budget aware, targeting the overall best solution found during the entire search. Moco is a fully learnable meta optimizer not utilizing problem specific heuristics or requiring optimal solutions for training. We test Moco on the Traveling Salesman Problem (TSP) and Maximum Independent Set (MIS) and show that it significantly improves over other heatmap based methods.

[352] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

Main category: cs.LG

TL;DR: Proposes a method to publish LLM benchmarks without revealing ground-truth answers by injecting randomness with multiple logically correct answers, enabling contamination detection while maintaining evaluation capability.

Details

Motivation: Prevent benchmark contamination in LLM training while avoiding the need for private benchmarks that require trust in a single organization and still permit test-set overfitting.

Method: Inject randomness by preparing several logically correct answers for each question and including only one as the solution, reducing the Bayes accuracy ceiling to detect contamination.

Result: Experimental evidence shows the method can accurately detect data contamination across various benchmarks, models, and training methodologies.

Conclusion: The approach enables open benchmark publishing while protecting against contamination and providing a reliable mechanism for detecting when models have been trained on benchmark data.

Abstract: Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

[353] Uncertainty-Guided Likelihood Tree Search

Julia Grosse, Ruotian Wu, Ahmad Rashid, Cheng Zhang, Philipp Hennig, Pascal Poupart, Agustinus Kristiadi

Main category: cs.LG

TL;DR: Uncertainty-guided tree search algorithm for sparse reward settings using likelihood-based path evaluation, requiring fewer expensive evaluations than existing methods.

Details

Motivation: Tree search faces combinatorial explosion in sparse reward settings, especially when likelihood evaluations are expensive (e.g., querying large language models).

Method: Derives probabilistic search heuristic based on regularity assumptions for likelihood, performs backtracking and exploration-exploitation trade-off without expensive roll-outs or Bayesian inference.

Result: Extensive experiments show the method identifies high-likelihood paths while requiring fewer costly evaluations compared to existing approaches.

Conclusion: The proposed uncertainty-guided tree search effectively handles sparse reward settings with expensive evaluations through a novel probabilistic heuristic approach.

Abstract: Tree search is a fundamental tool for planning, as many sequential decision-making problems can be framed as searching over tree-structured spaces. We propose an uncertainty-guided tree search algorithm for settings where the reward function is a log-likelihood function of the paths. Due to the combinatorial explosion of the tree size, the set of paths for which one can obtain rewards is sparse, particularly when the likelihood is obtained through expensive evaluations, such as by querying a large language model. We address this challenge by deriving an probabilistic search heuristic based on regularity assumptions for the likelihood. Unlike existing tree search methods, the proposed method can perform backtracking and trade-off exploration with exploitation, and yet does not require expensive roll-outs, or sophisticated Bayesian inference. Through extensive on-model and off-model experiments on timely, large-scale practical applications, we demonstrate that our method identifies paths with high likelihood while requiring fewer costly evaluations.

[354] Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer

Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

Main category: cs.LG

TL;DR: Transformers can perform competitively even with frozen attention weights, with attention mainly handling in-context reasoning and MLPs handling knowledge storage, showing the architecture has strong inductive bias for specialized circuits.

Details

Motivation: To understand how much of transformer performance gains are attributable to self-attention mechanism versus other components like MLP layers.

Method: Compare standard transformers to variants with frozen MLP layers or frozen attention weights, develop MixiT architecture with random attention scores, and prove expressivity results for transformers with frozen key/query weights.

Result: Attention with frozen key/query weights can form induction heads and perform competitively on language modeling. MixiT shows attention is responsible for in-context reasoning while MLPs handle knowledge storage.

Conclusion: Transformer architecture has built-in inductive bias towards forming specialized circuits even without learnable attention weights, with attention and MLPs playing complementary but distinct roles.

Abstract: The transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of tasks - including mathematical reasoning, memorization, and retrieval

using only gradient-based learning on next-token prediction. While the core component of a transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard transformers to variants in which either the MLP layers or the attention weights are frozen at initialization. Surprisingly, we find that attention with frozen key and query weights is not only able to form induction heads, but can also perform competitively on language modeling. We formalize this by proving a new expressivity result for transformer models with frozen key and query weights. To further isolate the contribution of attention, we design MixiT, an architecture with entirely random attention scores, with provably stable signal propagation that overcomes prior depth-wise scaling challenges in random transformers. We use the successes and failures of MixiT to understand the role each transformer component plays, such as attention being largely responsible for in-context reasoning, and MLPs being responsible for, but collaborates with attention, on knowledge storage. Our results suggest that the transformer architecture has a built-in inductive bias towards forming specialized circuits, as it does even without learnable attention weights.

[355] Retrieval-Augmented Generation with Estimation of Source Reliability

Jeongyeon Hwang, Junyoung Park, Hyejin Park, Dongwoo Kim, Sangdon Park, Jungseul Ok

Main category: cs.LG

TL;DR: RA-RAG is a reliability-aware retrieval-augmented generation framework that estimates source reliability and prioritizes highly reliable documents to improve factual accuracy in LLMs.

Details

Motivation: Standard RAG relies solely on relevance between query and document, overlooking heterogeneous source reliability which risks retrieving incorrect information.

Method: Estimates source reliability by cross-checking information across multiple sources, retrieves from top reliable sources, and aggregates using weighted majority voting.

Result: Consistently outperforms baselines in heterogeneous reliability scenarios and scales efficiently with increasing sources.

Conclusion: RA-RAG provides a practical solution for improving factual accuracy in multi-source RAG systems by incorporating reliability estimation.

Abstract: Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top-$\kappa$ reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources’ reliability, highlighting its practical applicability. \jy{Our code and data are available at \href{https://github.com/ml-postech/RA-RAG}{RA-RAG}.}

[356] Stochastic Parameter Decomposition

Lucius Bushnaq, Dan Braun, Lee Sharkey

Main category: cs.LG

TL;DR: SPD is a more scalable and robust alternative to APD for linear parameter decomposition of neural networks, enabling decomposition of larger models with better performance.

Details

Motivation: Current decomposition methods like APD are computationally expensive and hyperparameter-sensitive, limiting their practical application to larger neural networks.

Method: Stochastic Parameter Decomposition (SPD) - a new method that bridges causal mediation analysis and network decomposition, offering improved scalability and robustness.

Result: SPD successfully decomposes larger and more complex models than APD, avoids parameter shrinkage issues, and better identifies ground truth mechanisms in toy models.

Conclusion: SPD removes barriers to scaling linear parameter decomposition methods, opening new research possibilities in mechanistic interpretability for larger neural networks.

Abstract: A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition – a framework that has been proposed to resolve several issues with current decomposition methods – decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce \textit{Stochastic Parameter Decomposition} (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at https://github.com/goodfire-ai/spd/tree/spd-paper.

[357] Zero-shot Generalization in Inventory Management: Train, then Estimate and Decide

Tarkan Temizöz, Christina Imdahl, Remco Dijkman, Douniel Lamghari-Idrissi, Willem van Jaarsveld

Main category: cs.LG

TL;DR: Proposes a Train, then Estimate and Decide (TED) framework for training generally capable DRL agents that can handle inventory management with unknown demand and lead time parameters through zero-shot generalization.

Details

Motivation: Addresses challenges in deploying deep reinforcement learning for real-world inventory management, particularly dynamic environments and uncertain problem parameters like demand and lead time distributions, highlighting a research gap for sequential decision-making under parameter uncertainty.

Method: Introduces a unifying Super-Markov Decision Process formulation and the TED framework with three phases: training generally capable agents on varied problem instances, continuously estimating parameters during deployment, and making decisions based on estimates.

Result: The Generally Capable Lost Sales Network (GC-LSN) consistently outperforms traditional policies when parameters are known, and when parameters are unknown, it complements online learning methods with superior empirical performance when paired with Kaplan-Meier estimator.

Conclusion: The proposed framework successfully enables zero-shot generalization for inventory management under parameter uncertainty, demonstrating that trained agents can handle unseen problem instances without retraining while outperforming existing methods.

Abstract: Deploying deep reinforcement learning (DRL) in real-world inventory management presents challenges, including dynamic environments and uncertain problem parameters, e.g. demand and lead time distributions. These challenges highlight a research gap, suggesting a need for a unifying framework to model and solve sequential decision-making under parameter uncertainty. We address this by exploring an underexplored area of DRL for inventory management: training generally capable agents (GCAs) under zero-shot generalization (ZSG). Here, GCAs are advanced DRL policies designed to handle a broad range of sampled problem instances with diverse inventory challenges. ZSG refers to the ability to successfully apply learned policies to unseen instances with unknown parameters without retraining. We propose a unifying Super-Markov Decision Process formulation and the Train, then Estimate and Decide (TED) framework to train and deploy a GCA tailored to inventory management applications. The TED framework consists of three phases: training a GCA on varied problem instances, continuously estimating problem parameters during deployment, and making decisions based on these estimates. Applied to periodic review inventory problems with lost sales, cyclic demand patterns, and stochastic lead times, our trained agent, the Generally Capable Lost Sales Network (GC-LSN) consistently outperforms well-known traditional policies when problem parameters are known. Moreover, under conditions where demand and/or lead time distributions are initially unknown and must be estimated, we benchmark against online learning methods that provide worst-case performance guarantees. Our GC-LSN policy, paired with the Kaplan-Meier estimator, is demonstrated to complement these methods by providing superior empirical performance.

[358] An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

Brett Daley, Prabhat Nagarajan, Martha White, Marlos C. Machado

Main category: cs.LG

TL;DR: Analysis of TD learning methods that use two asymmetric value functions (QV-learning and AV-learning) vs single value function approaches, showing AV-learning methods outperform Q-learning in control settings and introducing a new RDQ algorithm that beats Dueling DQN.

Details

Motivation: To understand when and why learning two value functions (state values + action values) is advantageous over single value function approaches like Q-learning, and to provide theoretical analysis of convergence and sample efficiency for these methods.

Method: Analyzed QV-learning and AV-learning algorithmic families in terms of convergence properties and sample efficiency. Compared them against Expected Sarsa (prediction) and Q-learning (control). Introduced Regularized Dueling Q-learning (RDQ) as a new AV-learning algorithm.

Result: Both QV and AV-learning are more efficient than Expected Sarsa in prediction setting. Only AV-learning methods offer major benefits over Q-learning in control setting. RDQ significantly outperforms Dueling DQN in MinAtar benchmark.

Conclusion: Learning two asymmetric value functions can be advantageous, particularly AV-learning methods for control problems. The new RDQ algorithm demonstrates superior performance compared to existing approaches, validating the value of this two-function learning paradigm.

Abstract: The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value function (e.g., Q-learning and Sarsa). Significantly less attention has been given to methods that bootstrap from two asymmetric value functions: i.e., methods that learn state values as an intermediate step in learning action values. Existing algorithms in this vein can be categorized as either QV-learning or AV-learning. Though these algorithms have been investigated to some degree in prior work, it remains unclear if and when it is advantageous to learn two value functions instead of just one – and whether such approaches are theoretically sound in general. In this paper, we analyze these algorithmic families in terms of convergence and sample efficiency. We find that while both families are more efficient than Expected Sarsa in the prediction setting, only AV-learning methods offer any major benefit over Q-learning in the control setting. Finally, we introduce a new AV-learning algorithm called Regularized Dueling Q-learning (RDQ), which significantly outperforms Dueling DQN in the MinAtar benchmark.

[359] MARS: Unleashing the Power of Variance Reduction for Training Large Models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu

Main category: cs.LG

TL;DR: MARS is a unified optimization framework that combines variance reduction with preconditioned gradient methods to improve training efficiency for large models like GPT-2, outperforming AdamW significantly.

Details

Motivation: Variance reduction algorithms have not been widely successful in training deep neural networks and large language models despite their theoretical benefits, making them less favored in modern AI. The goal is to unleash the power of variance reduction for efficient training of large models.

Method: Proposes MARS framework that reconciles preconditioned gradient methods with variance reduction via scaled stochastic recursive momentum technique. Introduces three instances based on AdamW, Lion, and Shampoo preconditioned gradient updates.

Result: Experimental results on training GPT-2 models show that MARS consistently outperforms AdamW by a large margin.

Conclusion: MARS successfully integrates variance reduction with modern optimization techniques, demonstrating significant performance improvements for large model training and making variance reduction relevant again in modern AI applications.

Abstract: Training deep neural networks–and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.

[360] Segmenting Action-Value Functions Over Time-Scales in SARSA via TD($Δ$)

Mahammad Humayoo

Main category: cs.LG

TL;DR: SARSA($\Delta$) extends TD($\Delta$) decomposition to SARSA algorithm, using multiple discount factors instead of a single constant one to improve bias-variance tradeoff and accelerate convergence in episodic RL.

Details

Motivation: Traditional SARSA algorithms struggle with optimal bias-variance balance due to reliance on a single constant discount factor, limiting performance in long-horizon RL tasks.

Method: Enhanced temporal difference decomposition TD($\Delta$) applied to SARSA, splitting action-value function into components linked to specific discount factors to facilitate learning across multiple time scales.

Result: SARSA($\Delta$) reduces bias in updates, accelerates convergence in both deterministic and stochastic environments including dense reward Atari games, outperforms existing TD learning techniques in tabular and deep RL benchmarks.

Conclusion: The proposed SARSA($\Delta$) method provides more effective and consistent learning, particularly beneficial for long-horizon improvement scenarios in reinforcement learning.

Abstract: In numerous episodic reinforcement learning (RL) environments, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Traditional SARSA algorithms face challenges in achieving an optimal balance between bias and variation, primarily due to their dependence on a single, constant discount factor ($\eta$). This investigation enhances the temporal difference decomposition method, TD($\Delta$), by applying it to the SARSA algorithm, now designated as SARSA($\Delta$). SARSA is a widely used on-policy RL method that enhances action-value functions via temporal difference updates. By splitting the action-value function down into components that are linked to specific discount factors, SARSA($\Delta$) makes learning easier across a range of time scales. This analysis makes learning more effective and ensures consistency, particularly in situations where long-horizon improvement is needed. The results of this research show that the suggested strategy works to lower bias in SARSA’s updates and speed up convergence in both deterministic and stochastic settings, even in dense reward Atari environments. Experimental results from a variety of benchmark settings show that the proposed SARSA($\Delta$) outperforms existing TD learning techniques in both tabular and deep RL environments.

[361] Multi-Label Bayesian Active Learning with Inter-Label Relationships

Yuanyuan Qi, Jueqing Lu, Xiaohao Yang, Joanne Enticott, Lan Du

Main category: cs.LG

TL;DR: A novel multi-label active learning strategy that addresses label correlation and data imbalance challenges through progressive correlation matrices and ensemble pseudo labeling with beta scoring.

Details

Motivation: Existing multi-label active learning methods either require excessive computational resources to capture label correlations or fail to fully explore label dependencies, while also struggling with imbalanced data distributions in real-world scenarios.

Method: Incorporates progressively updated positive and negative correlation matrices to capture co-occurrence and disjoint relationships, uses ensemble pseudo labeling and beta scoring rules to address data imbalances, and provides holistic uncertainty assessment rather than treating labels as isolated elements.

Result: Extensive experiments on four realistic datasets demonstrate consistently reliable and superior performance compared to several established methods.

Conclusion: The proposed strategy effectively addresses both label correlation and data imbalance challenges in multi-label active learning, achieving more reliable performance than existing approaches.

Abstract: The primary challenge of multi-label active learning, differing it from multi-class active learning, lies in assessing the informativeness of an indefinite number of labels while also accounting for the inherited label correlation. Existing studies either require substantial computational resources to leverage correlations or fail to fully explore label dependencies. Additionally, real-world scenarios often require addressing intrinsic biases stemming from imbalanced data distributions. In this paper, we propose a new multi-label active learning strategy to address both challenges. Our method incorporates progressively updated positive and negative correlation matrices to capture co-occurrence and disjoint relationships within the label space of annotated samples, enabling a holistic assessment of uncertainty rather than treating labels as isolated elements. Furthermore, alongside diversity, our model employs ensemble pseudo labeling and beta scoring rules to address data imbalances. Extensive experiments on four realistic datasets demonstrate that our strategy consistently achieves more reliable and superior performance, compared to several established methods.

[362] Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan, Emma Slade

Main category: cs.LG

TL;DR: Dataset distillation via optimal quantization reformulates disentangled methods as clustering in latent space, achieving SOTA performance on ImageNet with minimal computation.

Details

Motivation: Existing dataset distillation methods face scalability issues with large datasets. Disentangled methods offer speed advantages but need theoretical grounding and improved performance.

Method: Proposes Dataset Distillation by Optimal Quantization, linking disentangled methods to classical optimal quantization problems. Uses encoder-decoder structure and clustering in latent space to approximate probability distributions.

Result: Achieves better performance and inter-model generalization on ImageNet-1K than previous SOTA D4M with trivial additional computation. Obtains SOTA distillation performance using distilled noise initializations in diffusion transformer models.

Conclusion: Reformulating dataset distillation as optimal quantization provides theoretical consistency, scalability, and state-of-the-art performance across various settings while maintaining computational efficiency.

Abstract: Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

[363] IC-Cache: Efficient Large Language Model Serving via In-context Caching

Yifan Yu, Yu Gan, Nikhil Sarda, Lillian Tsai, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler

Main category: cs.LG

TL;DR: IC-Cache is a caching system that improves LLM serving efficiency by using historical request-response pairs from larger models as in-context examples, enabling small LLMs to imitate larger models’ capabilities and reduce latency/cost.

Details

Motivation: Over 70% of user requests to LLMs have semantically similar counterparts, suggesting potential for knowledge transfer, but naive caching causes quality degradation. There's a need to improve serving efficiency while maintaining quality.

Method: IC-Cache selects similar high-utility examples from historical request-response pairs and prepends them to new requests. It uses adaptive routing across LLMs of varying capabilities and employs cost-aware cache replay to refine example quality offline.

Result: IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without compromising response quality, as demonstrated on millions of realistic requests.

Conclusion: IC-Cache successfully enables live LLM capability augmentation through intelligent caching and example selection, making LLM serving more efficient while maintaining quality through knowledge transfer from larger to smaller models.

Abstract: Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge transfer among requests. However, naively caching and reusing past responses leads to a big quality drop. In this paper, we introduce IC-Cache, a caching system that enables live LLM capability augmentation to improve serving efficiency: by leveraging historical request-response pairs from larger models as in-context examples, IC-Cache empowers small LLMs to imitate and even exceed the compositional abilities (e.g., reasoning) of their larger counterparts, enabling selective offloading of requests to reduce cost and latency. Achieving this live augmentation at scale introduces intricate trade-offs between response quality, latency, and system throughput. For a new request, IC-Cache efficiently selects similar, high-utility examples to prepend them to the new request’s input. At scale, it adaptively routes requests across LLMs of varying capabilities, accounting for response quality and serving loads. IC-Cache employs a cost-aware cache replay mechanism that refines example quality offline to maximize online cache utility and efficiency. Evaluations on millions of realistic requests demonstrate that IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without hurting response quality.

[364] Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics

Jianhua Liu, Zheng Liu, Yu Xiang, Yanwen Qu

Main category: cs.LG

TL;DR: Proposes diffusion model-based probabilistic forecasting for QoS metrics in DTNs, outperforming traditional methods by capturing uncertainty and complex temporal patterns.

Details

Motivation: Traditional mean regression methods fail to capture data complexity in DTN QoS prediction, leading to poor performance in operational tasks like routing. Probabilistic forecasting can quantify uncertainty and improve network performance.

Method: Formulates QoS prediction as probabilistic forecasting using diffusion models that incorporate latent temporal dynamics of non-stationary and multi-mode data.

Result: Extensive experiments show the proposed approach outperforms popular probabilistic time series forecasting methods.

Conclusion: Diffusion models effectively capture uncertainty and complex patterns in DTN QoS metrics, providing superior probabilistic forecasting compared to existing methods.

Abstract: Active QoS metric prediction, commonly employed in the maintenance and operation of DTN, could enhance network performance regarding latency, throughput, energy consumption, and dependability. Naturally formulated as a multivariate time series forecasting problem, it attracts substantial research efforts. Traditional mean regression methods for time series forecasting cannot capture the data complexity adequately, resulting in deteriorated performance in operational tasks in DTNs such as routing. This paper formulates the prediction of QoS metrics in DTN as a probabilistic forecasting problem on multivariate time series, where one could quantify the uncertainty of forecasts by characterizing the distribution of these samples. The proposed approach hires diffusion models and incorporates the latent temporal dynamics of non-stationary and multi-mode data into them. Extensive experiments demonstrate the efficacy of the proposed approach by showing that it outperforms the popular probabilistic time series forecasting methods.

[365] Imputation-free Learning of Tabular Data with Missing Values using Incremental Feature Partitions in Transformer

Manar D. Samad, Kazi Fuad B. Akhter, Shourav B. Rabbani, Ibna Kowsar

Main category: cs.LG

TL;DR: IFIAL is an imputation-free method that uses attention masks and incremental learning on feature partitions to handle missing values in tabular data without traditional imputation, achieving superior performance over state-of-the-art methods.

Details

Motivation: Traditional imputation methods for handling missing values in tabular data raise concerns about data quality and reliability of outcomes, creating a need for imputation-free approaches.

Method: Derives and retrofits a pair of attention masks to a transformer, incrementally learning partitions of overlapping fixed-size feature sets to process tabular data without imputing missing values.

Result: Achieved top average classification performance rank across 17 diverse tabular datasets, outperforming 11 state-of-the-art methods with and without imputation, and demonstrated robustness against varying missing value types and rates.

Conclusion: IFIAL is a pioneering deep attention learning solution that successfully handles tabular data without missing value imputation, with optimal feature partition size being half the original feature space for both computational efficiency and accuracy.

Abstract: Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns about data quality and the reliability of data-driven outcomes. To address these concerns, this article proposes an imputation-free incremental attention learning (IFIAL) method for tabular data. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the efficiency and performance of the transformer. The average classification performance rank order across 17 diverse tabular data sets highlights the superiority of IFIAL over 11 state-of-the-art learning methods with or without missing value imputations. Further experiments substantiate the robustness of IFIAL against varying missing value types and rates compared to methods involving missing value imputation. Our analysis reveals that a feature partition size of half the original feature space is, both computationally and in terms of accuracy, the best choice for the proposed incremental learning. The proposed method is one of the first solutions to enable deep attention learning of tabular data without requiring missing-value imputation. The source code for this paper is publicly available.

[366] Technology prediction of a 3D model using Neural Network

Grzegorz Miebs, Rafał A. Bachorz

Main category: cs.LG

TL;DR: A neural network approach that predicts manufacturing step durations from 3D models with under 3 seconds error, using 2D renderings and GQN-inspired architecture.

Details

Motivation: Traditional production time estimation methods using expert analysis or historical data are inadequate for dynamic and customized manufacturing environments, requiring a more accurate and automated solution.

Method: Renders 3D product models into multiple 2D images and uses a neural network inspired by Generative Query Network to map geometric features to time estimates for predefined production steps.

Result: Achieves mean absolute error below 3 seconds for manufacturing step duration predictions, enabling easier planning across varied product types.

Conclusion: The data-driven approach successfully bridges the gap between 3D geometric features and production time estimation, providing accurate predictions that outperform traditional methods in dynamic manufacturing settings.

Abstract: Accurate estimation of production times is critical for effective manufacturing scheduling, yet traditional methods relying on expert analysis or historical data often fall short in dynamic or customized production environments. This paper introduces a data-driven approach that predicts manufacturing steps and their durations directly from 3D models of products with exposed geometries. By rendering the model into multiple 2D images and leveraging a neural network inspired by the Generative Query Network, the method learns to map geometric features into time estimates for predefined production steps with a mean absolute error below 3 seconds making planning across varied product types easier.

[367] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: RLVR’s potential limited by depth and breadth dimensions. DARS addresses depth neglect by re-weighting hard problems through multi-stage rollouts, improving Pass@K without extra cost. Large breadth training enhances Pass@1 by sustaining exploration. DARS-B combines both for simultaneous gains.

Details

Motivation: Current RLVR approaches suffer from systematic bias where cumulative-advantage disproportionately weights medium-accuracy samples while neglecting low-accuracy instances crucial for pushing reasoning boundaries, limiting the full potential of reasoning capabilities in LLMs.

Method: Introduced Difficulty Adaptive Rollout Sampling (DARS) with targeted multi-stage rollouts to re-weight hard problems. Also scaled batch size aggressively, replacing PPO’s mini-batch iterations with full-batch updates over multiple epochs to increase breadth.

Result: DARS delivered consistent Pass@K gains without extra inference cost. Large-breadth training significantly enhanced Pass@1 performance and sustained high token-level entropy. DARS-B demonstrated simultaneous gains in both Pass@K and Pass@1.

Conclusion: Breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, and addressing both through DARS and large-breadth training is key to unleashing the full reasoning power of RLVR in language models.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

[368] Federated Isolation Forest for Efficient Anomaly Detection on Edge IoT Systems

Pavle Vasiljevic, Milica Matic, Miroslav Popovic

Main category: cs.LG

TL;DR: Isolation Forest-based temperature anomaly detection system for edge devices using federated learning frameworks, achieving high accuracy with low memory usage

Details

Motivation: Address user privacy concerns and efficiency in embedded systems through federated learning, particularly for resource-constrained IoT devices running MicroPython

Method: Application of Isolation Forest-based anomaly detection (FLiForest algorithm) using Python and MicroPython federated learning frameworks for temperature monitoring on small edge devices

Result: Over 96% accuracy in distinguishing normal/abnormal readings, above 78% precision in anomaly detection, memory usage below 160 KB during model training

Conclusion: The system is highly suitable for resource-constrained edge environments while maintaining federated learning principles of data privacy and collaborative learning

Abstract: Recently, federated learning frameworks such as Python TestBed for Federated Learning Algorithms and MicroPython TestBed for Federated Learning Algorithms have emerged to tackle user privacy concerns and efficiency in embedded systems. Even more recently, an efficient federated anomaly detection algorithm, FLiForest, based on Isolation Forests has been developed, offering a low-resource, unsupervised method well-suited for edge deployment and continuous learning. In this paper, we present an application of Isolation Forest-based temperature anomaly detection, developed using the previously mentioned federated learning frameworks, aimed at small edge devices and IoT systems running MicroPython. The system has been experimentally evaluated, achieving over 96% accuracy in distinguishing normal from abnormal readings and above 78% precision in detecting anomalies across all tested configurations, while maintaining a memory usage below 160 KB during model training. These results highlight its suitability for resource-constrained environments and edge systems, while upholding federated learning principles of data privacy and collaborative learning.

[369] Plugging Attention into Power Grids: Towards Transparent Forecasting

Eloi Campagne, Itai Zehavi, Yvenn Amara-Ouali, Yannig Goude, Argyris Kalogeratos

Main category: cs.LG

TL;DR: GNNs outperform classical models for electricity demand forecasting by capturing spatial dependencies, with simpler architectures like GCN and SAGE performing best in low-data settings, while GAT provides strong accuracy and interpretability through attention analysis.

Details

Motivation: Electricity demand prediction is crucial for grid stability, but classical approaches like GAMs fail to capture spatial dependencies in energy networks, creating a need for methods that can incorporate graph structure.

Method: Evaluated multiple GNN architectures (GCN, GraphSAGE, ChebConv, TAG, APPNP, TransformerConv, GAT, GATv2) on real-world electricity consumption datasets from France and UK, with temporal analysis of attention weights and ensemble-based expert aggregation strategies.

Result: Simpler models (GCN, SAGE, APPNP) outperform complex alternatives in low-data regimes; GAT ranks among strongest architectures with high accuracy and interpretability; ensemble strategies, particularly bottom-up combinations, significantly improve robustness and achieve state-of-the-art performance.

Conclusion: GNNs offer both accurate forecasting and interpretability for energy analytics, with architectural simplicity coupled with ensemble methods providing a practical path forward for transparent electricity demand prediction.

Abstract: Reliable prediction of electricity demand plays a key role in safeguarding grid stability and guiding generation decisions, a need that grows with the decentralization and complexity of modern systems. While classical approaches such as Generalized Additive Models (GAMs) remain widely used, they often fail to capture the spatial dependencies inherent in energy networks. Graph Neural Networks (GNNs) offer a principled framework to incorporate this structure by directly leveraging graph topologies. In this work, we evaluate a broad set of GNN architectures – including GCN, GraphSAGE, ChebConv, TAG, APPNP, TransformerConv, and Graph Attention Networks (GAT and GATv2) – on two real-world electricity consumption datasets from France and the UK. Our results show that simpler models such as GCN, SAGE, or APPNP often outperform more complex alternatives in low-data regimes, while GAT ranks among the strongest architectures in our benchmarks, combining high accuracy with valuable interpretability. We perform a temporal analysis of attention weights, revealing evolving patterns of regional interaction linked to seasonal and meteorological variability. These results highlight that, although attention is not universally superior, it provides valuable explanatory power when spatial dependencies are prominent. Additionally, we demonstrate that ensemble-based expert aggregation strategies, particularly bottom-up combinations, significantly improve robustness and yield state-of-the-art performance across both datasets. These findings highlight the dual promise of GNNs for accurate and interpretable forecasting, and suggest that architectural simplicity coupled with ensemble methods can provide a practical path forward for transparent energy analytics.

[370] Mitigating Message Imbalance in Fraud Detection with Dual-View Graph Representation Learning

Yudan Song, Yuecen Wei, Yuhang Lu, Qingyun Sun, Minglai Shao, Li-e Wang, Chunming Hu, Xianxian Li, Xingcheng Fu

Main category: cs.LG

TL;DR: Proposes MimbFD, a dual-view graph learning method to address message imbalance in fraud detection by improving topological message reachability and local confounding debiasing.

Details

Motivation: GNN-based fraud detection suffers from imbalanced supervisory messages due to fraudsters' topological obfuscation and feature concealment, leading to insufficient global information propagation and class imbalance issues.

Method: Dual-view graph representation learning with: 1) topological message reachability module to penetrate fraudsters’ camouflage and improve propagation, 2) local confounding debiasing module to adjust node representations and balance class influence.

Result: Experiments on three public fraud datasets show MimbFD achieves outstanding performance in fraud detection.

Conclusion: The proposed method effectively mitigates message imbalance issues in fraud detection by addressing both topological information propagation and class imbalance problems through dual-view learning.

Abstract: Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters’ topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection (MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters’ camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.

[371] Beacon: Post-Training Quantization with Integrated Grid Selection

Shihao Zhang, Rayan Saab

Main category: cs.LG

TL;DR: Beacon is a tuning-free per-channel post-training quantization method that automatically determines optimal scaling factors using unscaled grids and quantization geometry, achieving competitive performance without back-propagation or large calibration sets.

Details

Motivation: Existing per-channel PTQ methods require manual tuning of scaling factors via heuristics or grid search, which is inefficient and time-consuming for model deployment.

Method: Uses unscaled integer grids and exploits the geometry of scalar quantization to automatically determine optimal scaling factors without back-propagation or large calibration datasets.

Result: Achieves competitive performance compared to state-of-the-art quantization methods despite its simplicity and tuning-free nature.

Conclusion: Beacon provides a practical, efficient solution for model quantization that eliminates manual tuning while maintaining performance, making it suitable for real-world deployment scenarios.

Abstract: Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled integer grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using an unscaled grid and automatically determines the optimal scaling factors by exploiting the geometry of scalar quantization. It does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.

[372] Recursive Reward Aggregation

Yuting Tang, Yivan Zhang, Johannes Ackermann, Yu-Jie Zhang, Soichiro Nishimori, Masashi Sugiyama

Main category: cs.LG

TL;DR: Proposes a new approach for RL behavior alignment using algebraic MDP perspective and alternative reward aggregation functions instead of reward function redesign.

Details

Motivation: Aligning agent behavior with complex objectives typically requires careful reward function design, which can be challenging for complex objectives.

Method: Introduces algebraic perspective on MDPs showing Bellman equations emerge from recursive reward generation/aggregation, generalizing beyond discounted sum to functions like discounted max and Sharpe ratio.

Result: Approach works for both deterministic/stochastic settings, integrates with value-based/actor-critic algorithms, and effectively optimizes diverse objectives in experiments.

Conclusion: The method provides flexible behavior alignment without reward function modification, demonstrating versatility and real-world application potential.

Abstract: In reinforcement learning (RL), aligning agent behavior with specific objectives typically requires careful design of the reward function, which can be challenging when the desired objectives are complex. In this work, we propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function by selecting appropriate reward aggregation functions. By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the recursive generation and aggregation of rewards, allowing for the generalization of the standard discounted sum to other recursive aggregations, such as discounted max and Sharpe ratio. Our approach applies to both deterministic and stochastic settings and integrates seamlessly with value-based and actor-critic algorithms. Experimental results demonstrate that our approach effectively optimizes diverse objectives, highlighting its versatility and potential for real-world applications.

[373] Emergence of Quantised Representations Isolated to Anisotropic Functions

George Bird

Main category: cs.LG

TL;DR: Novel methodology shows activation function symmetries cause unintended inductive biases, with discrete symmetries producing quantized representations and continuous symmetries maintaining continuous representations.

Details

Motivation: To understand how discrete representations emerge in autoencoders and determine if function-driven symmetries act as implicit inductive biases on representations.

Method: Controlled ablation study using Spotlight Resonance method where only activation functions are altered, comparing discrete algebraic permutation-equivariant symmetry vs continuous algebraic orthogonal-equivariant symmetry.

Result: Discrete activation symmetries cause representations to discretize (quantization effect), while continuous symmetries maintain continuous representations. Quantization correlates with increased reconstruction error.

Conclusion: Symmetries of network primitives carry unintended inductive biases that produce task-independent artifactual structures, motivating reassessment of common functional forms and providing insights for interpretability research.

Abstract: This paper presents a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method. This new tool is used to gain insight into how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered. Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. This confirms the hypothesis that the symmetries of network primitives can carry unintended inductive biases, which produce task-independent artefactual structures in representations. The discrete symmetry of contemporary forms is shown to be a strong predictor for the production of discrete representations emerging from otherwise continuous distributions – a quantisation effect. This motivates further reassessment of functional forms in common usage due to such unintended consequences. Moreover, this supports a general causal model for one mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and possibly Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide insights into interpretability research. Finally, preliminary results indicate that quantisation of representations appears to correlate with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.

[374] Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino

Main category: cs.LG

TL;DR: Using fine-tuned multimodal vision-language models for video retrieval outperforms traditional supervised learning methods in overcoming challenges of short-form video recommendation systems, particularly addressing position bias and duration bias issues.

Details

Motivation: Social media platforms are increasingly incorporating short-form video content to engage users, but this introduces new challenges for recommender systems including limited interaction data, strong position bias from immersive UI designs, and duration bias when optimizing for watch-time.

Method: Leveraged a video retrieval system using a fine-tuned multimodal vision-language model instead of conventional supervised learning methods to address the recommendation challenges in short-form video experiences.

Result: The approach demonstrated greater effectiveness compared to traditional supervised learning methods in online experiments conducted on an e-commerce platform, showing better performance even with sufficient video interaction data.

Conclusion: Fine-tuned multimodal vision-language models provide a more effective solution for short-form video recommendation systems, overcoming the inherent biases and feedback loop challenges that conventional methods struggle with in immersive feed experiences.

Abstract: In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.

[375] TimeCopilot

Azul Garza, Reneé Rosillo

Main category: cs.LG

TL;DR: TimeCopilot is the first open-source agentic framework for time series forecasting that combines multiple Time Series Foundation Models with LLMs through a unified API, automating the forecasting pipeline and achieving state-of-the-art performance.

Details

Motivation: To create a practical, reproducible, and accessible agentic forecasting system that automates the entire forecasting pipeline while providing natural language explanations and supporting direct queries about the future.

Method: Combines multiple Time Series Foundation Models (TSFMs) with Large Language Models (LLMs) through a single unified API, supporting ensembles across diverse forecasting families and being LLM-agnostic (compatible with both commercial and open-source models).

Result: Achieves state-of-the-art probabilistic forecasting performance on the large-scale GIFT-Eval benchmark at low cost.

Conclusion: TimeCopilot provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems that automate the forecasting pipeline while maintaining high performance.

Abstract: We introduce TimeCopilot, the first open-source agentic framework for forecasting that combines multiple Time Series Foundation Models (TSFMs) with Large Language Models (LLMs) through a single unified API. TimeCopilot automates the forecasting pipeline: feature analysis, model selection, cross-validation, and forecast generation, while providing natural language explanations and supporting direct queries about the future. The framework is LLM-agnostic, compatible with both commercial and open-source models, and supports ensembles across diverse forecasting families. Results on the large-scale GIFT-Eval benchmark show that TimeCopilot achieves state-of-the-art probabilistic forecasting performance at low cost. Our framework provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems.

[376] Pulling Back the Curtain on ReLU Networks

Maciej Satkiewicz

Main category: cs.LG

TL;DR: The paper introduces excitation pullbacks - modified gradients using soft gating in backward pass that reveal aligned perceptual features in ReLU networks, providing better interpretability and theoretical insights into network behavior.

Details

Motivation: ReLU networks' gradients are notoriously misaligned, obscuring internal representations. The authors posit that models do align gradients with data, but this is concealed by ReLU hard gating noise.

Method: Apply soft gating in backward pass only to reduce impact of weakly excited neurons, creating excitation pullbacks. Analyze these modified gradients on ImageNet-pretrained architectures.

Result: Excitation pullbacks exhibit striking perceptual alignment and produce easily interpretable input- and target-specific features through gradient ascent. Formulate path stability hypothesis about binary activation patterns.

Conclusion: Excitation pullbacks provide faithful feature attributions and potential mechanistic interpretability. Also explain effectiveness of Batch Normalization and Deep Features, offering new perspective on network memory and generalization.

Abstract: Since any ReLU network is piecewise affine, its hidden units can be characterized by their pullbacks through the active subnetwork, i.e., by their gradients (up to bias terms). However, gradients of deeper neurons are notoriously misaligned, which obscures the network’s internal representations. We posit that models do align gradients with data, yet this is concealed by the intrinsic noise of the ReLU hard gating. We validate this intuition by applying soft gating in the backward pass only, reducing the local impact of weakly excited neurons. The resulting modified gradients, which we call “excitation pullbacks”, exhibit striking perceptual alignment on a number of ImageNet-pretrained architectures, while the rudimentary pixel-space gradient ascent quickly produces easily interpretable input- and target-specific features. Inspired by these findings, we formulate the “path stability” hypothesis, claiming that the binary activation patterns largely stabilize during training and get encoded in the pre-activation distribution of the final model. When true, excitation pullbacks become aligned with the gradients of a kernel machine that mainly determines the network’s decision. This provides a theoretical justification for the apparent faithfulness of the feature attributions based on these pullbacks, potentially even leading to mechanistic interpretability of deeper models. Incidentally, we give a possible explanation for the effectiveness of Batch Normalization and Deep Features, together with a novel perspective on the network’s internal memory and generalization properties. We release the code and an interactive app for easier exploration of the excitation pullbacks.

[377] Structure Transfer: an Inference-Based Calculus for the Transformation of Representations

Daniel Raggi, Gem Stapleton, Mateja Jamnik, Aaron Stockdill, Grecia Garcia Garcia, Peter C-H. Cheng

Main category: cs.LG

TL;DR: A novel calculus called structure transfer enables representation transformation across diverse representational systems while ensuring specified relations like semantic equivalence are maintained.

Details

Motivation: To solve the fundamental problem of representation choice by developing system-agnostic techniques that can drive representation transformation across different representational systems.

Method: Structure transfer calculus uses schemas that encode knowledge about representational systems to generate target representations from source representations while preserving specified relations. Built on Representational Systems Theory and construction spaces to model diverse RS types.

Result: The approach provides a general framework for representation transformation that works across formal languages, geometric figures, diagrams, and informal notations.

Conclusion: Structure transfer is a system-agnostic calculus capable of identifying alternative representations in various practical settings by ensuring desired relations between source and target representations.

Abstract: Representation choice is of fundamental importance to our ability to communicate and reason effectively. A major unsolved problem, addressed in this paper, is how to devise representational-system (RS) agnostic techniques that drive representation transformation and choice. We present a novel calculus, called structure transfer, that enables representation transformation across diverse RSs. Specifically, given a source representation drawn from a source RS, the rules of structure transfer allow us to generate a target representation for a target RS. The generality of structure transfer comes in part from its ability to ensure that the source representation and the generated target representation satisfy any specified relation (such as semantic equivalence). This is done by exploiting schemas, which encode knowledge about RSs. Specifically, schemas can express preservation of information across relations between any pair of RSs, and this knowledge is used by structure transfer to derive a structure for the target representation which ensures that the desired relation holds. We formalise this using Representational Systems Theory, building on the key concept of a construction space. The abstract nature of construction spaces grants them the generality to model RSs of diverse kinds, including formal languages, geometric figures and diagrams, as well as informal notations. Consequently, structure transfer is a system-agnostic calculus that can be used to identify alternative representations in a wide range of practical settings.

[378] UniExtreme: A Universal Foundation Model for Extreme Weather Forecasting

Hang Ni, Weijia Zhang, Hao Liu

Main category: cs.LG

TL;DR: UniExtreme is a universal foundation model for extreme weather forecasting that addresses spectral disparities and hierarchical drivers of diverse extreme events through adaptive frequency modulation and event prior augmentation modules.

Details

Motivation: Existing foundation models for weather forecasting have limited ability to predict extreme weather events, as they either focus on general conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events.

Method: Proposes UniExtreme with two key modules: (1) Adaptive Frequency Modulation (AFM) module with learnable Beta-distribution filters and multi-granularity spectral aggregation to capture region-wise spectral differences, and (2) Event Prior Augmentation (EPA) module with dual-level memory fusion network to incorporate region-specific extreme event priors for resolving hierarchical extreme diversity.

Result: Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showing superior adaptability across diverse extreme scenarios.

Conclusion: The proposed UniExtreme model effectively addresses the limitations of existing approaches by capturing both spectral characteristics and hierarchical drivers of extreme weather events, achieving superior performance in universal extreme weather forecasting.

Abstract: Recent advancements in deep learning have led to the development of Foundation Models (FMs) for weather forecasting, yet their ability to predict extreme weather events remains limited. Existing approaches either focus on general weather conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events. In this work, we identify two key characteristics of extreme events: (1) the spectral disparity against normal weather regimes, and (2) the hierarchical drivers and geographic blending of diverse extremes. Along this line, we propose UniExtreme, a universal extreme weather forecasting foundation model that integrates (1) an Adaptive Frequency Modulation (AFM) module that captures region-wise spectral differences between normal and extreme weather, through learnable Beta-distribution filters and multi-granularity spectral aggregation, and (2) an Event Prior Augmentation (EPA) module which incorporates region-specific extreme event priors to resolve hierarchical extreme diversity and composite extreme schema, via a dual-level memory fusion network. Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showcasing superior adaptability across diverse extreme scenarios.

[379] One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen

Main category: cs.LG

TL;DR: Two-stage pipeline using MIST encoder and MolForge decoder with probability thresholding achieves 10x improvement over previous methods for de novo molecular generation from mass spectra.

Details

Motivation: To improve de novo molecular generation from mass spectra by enhancing the two-stage pipeline approach with better encoding/decoding and probability thresholding.

Method: Used MIST as encoder and MolForge as decoder with additional training data, and applied probability thresholding on fingerprint bits to focus on substructure presence.

Result: Achieved top-1 28% and top-10 36% correct molecular structure generation from mass spectra in MassSpecGym, representing a tenfold improvement over previous state-of-the-art methods.

Conclusion: This approach establishes a strong baseline for future research in de novo molecule elucidation from mass spectra.

Abstract: A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST as the encoder and MolForge as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28% / top-10 36% of molecular structures correctly from mass spectra in MassSpecGym. We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

[380] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, XiaoLong Hu, Ge Li

Main category: cs.LG

TL;DR: EvoCoT is a self-evolving curriculum learning framework that uses two-stage chain-of-thought reasoning optimization to help LLMs learn from hard problems with sparse rewards, enabling stable learning and improved reasoning without external supervision.

Details

Motivation: Existing RLVR approaches face limitations with sparse rewards on hard problems, either requiring stronger LLMs for distillation or filtering out difficult problems, which restricts scalability and reasoning improvement through exploration.

Method: EvoCoT constrains exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way through two-stage chain-of-thought reasoning optimization.

Result: EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods across multiple LLM families including Qwen, DeepSeek, and Llama.

Conclusion: The proposed EvoCoT framework effectively addresses sparse reward challenges in RLVR, enabling stable learning from hard problems and demonstrating broad applicability across different LLM families and RL fine-tuning methods.

Abstract: Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

[381] SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation

Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh

Main category: cs.LG

TL;DR: SWiFT is a debiasing framework that improves fairness while preserving model performance with minimal data and training requirements, outperforming state-of-the-art methods across multiple medical imaging datasets.

Details

Motivation: Machine learning models exhibit bias in healthcare applications, risking unfairness and social discrimination. Existing debiasing approaches require extensive retraining, original training data access, and face fairness-performance trade-offs.

Method: Soft-Mask Weight Fine-Tuning (SWiFT) identifies parameter contributions to bias vs performance, then uses two-step fine-tuning with different gradient flows based on each parameter’s contribution. Requires only small external dataset and few epochs.

Result: Extensive experiments across 6 medical datasets with 3 bias attributes show SWiFT consistently reduces bias while achieving competitive/superior diagnostic accuracy. Demonstrates improved generalization on out-of-distribution datasets.

Conclusion: SWiFT provides an efficient debiasing solution that addresses limitations of existing methods, requiring minimal resources while maintaining both fairness and performance in healthcare ML models.

Abstract: Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.

[382] Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

Victoria Yan, Honor Chotkowski, Fengran Wang, Xinhui Li, Jessica Saurman, Fadi Nahab, David Loring, Carl Yang, Jiaying Lu, Runze Yan, Xiao Hu, Alex Fedorov

Main category: cs.LG

TL;DR: Generative multimodal LLMs can create synthetic normative data for cognitive tests using advanced prompting strategies, overcoming traditional data collection limitations.

Details

Motivation: Cognitive assessments need normative data but traditional collection methods are costly and time-consuming. New image-based tests lack readily available normative benchmarks.

Method: Used GPT-4o and GPT-4o-mini with naive and advanced prompting strategies to generate synthetic responses for image-based cognitive tests like ‘Cookie Theft’ task. Evaluated using embedding analysis, BLEU, ROUGE, BERTScore, and LLM-as-a-judge evaluation.

Result: Advanced prompting produced synthetic responses that better distinguished diagnostic groups and captured demographic diversity. BERTScore was most reliable for similarity assessment, while LLM-as-a-judge showed promising validation results.

Conclusion: Generative multimodal LLMs with refined prompting can feasibly generate robust synthetic normative data, enabling development of novel image-based cognitive assessments without traditional limitations.

Abstract: Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the “Cookie Theft” picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.

[383] Understanding sparse autoencoder scaling in the presence of feature manifolds

Eric J. Michaud, Liv Gorton, Tom McGrath

Main category: cs.LG

TL;DR: SAEs show scaling laws with number of latents, but feature manifolds can cause pathological behavior where SAEs learn fewer features than available latents.

Details

Motivation: To understand how sparse autoencoders scale with the number of latents and how feature manifolds influence their scaling behavior, building on existing neural scaling literature.

Method: Adapt a capacity-allocation model from neural scaling literature to analyze SAE scaling, particularly focusing on how multi-dimensional features (feature manifolds) affect scaling regimes.

Result: The model identifies distinct scaling regimes, with one pathological regime where feature manifolds cause SAEs to learn significantly fewer features than the number of available latents.

Conclusion: Feature manifolds can have pathological effects on SAE scaling, and preliminary discussion suggests SAEs may operate in this problematic regime in practical applications.

Abstract: Sparse autoencoders (SAEs) model the activations of a neural network as linear combinations of sparsely occurring directions of variation (latents). The ability of SAEs to reconstruct activations follows scaling laws w.r.t. the number of latents. In this work, we adapt a capacity-allocation model from the neural scaling literature (Brill, 2024) to understand SAE scaling, and in particular, to understand how “feature manifolds” (multi-dimensional features) influence scaling behavior. Consistent with prior work, the model recovers distinct scaling regimes. Notably, in one regime, feature manifolds have the pathological effect of causing SAEs to learn far fewer features in data than there are latents in the SAE. We provide some preliminary discussion on whether or not SAEs are in this pathological regime in the wild.

[384] Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm

Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Earl Lawrence

Main category: cs.LG

TL;DR: A test-time computing strategy for PDEs that uses computational resources during inference to improve prediction accuracy with fewer training samples and smaller models, using reward models for spatio-temporal consistency evaluation.

Details

Motivation: Existing PDE foundation models are constrained by pretraining datasets, struggle with auto-regressive rollout performance in out-of-distribution cases, and have high compute and training data requirements that limit critical applications.

Method: Introduces a test-time computing strategy inspired by LLM thinking strategies, using two types of reward models to evaluate predictions of a stochastic model for spatio-temporal consistency on compressible Euler-equation simulations.

Result: TTC captures improved predictions relative to standard non-adaptive auto-regressive inference, demonstrating better performance with fewer resources.

Conclusion: This TTC framework represents a foundational step toward more advanced reasoning algorithms for PDE modeling, including reinforcement-learning-based approaches that could transform computational workflows in physics and engineering.

Abstract: Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking” strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

[385] Discrete Functional Geometry of ReLU Networks via ReLU Transition Graphs

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: Extends ReLU Transition Graph framework to model deep ReLU networks as graphs where nodes are linear activation regions and edges connect regions differing by single ReLU flips, revealing structural properties that govern generalization.

Details

Motivation: To develop a comprehensive graph-theoretic model for understanding deep ReLU networks by analyzing their discrete geometric structure and connecting structural properties to generalization behavior.

Method: Extends RTG framework with theoretical proofs about expansion, degree distributions, and spectral properties at random initialization, plus empirical construction of RTGs for small networks to measure smoothness and connectivity properties.

Result: Shows region entropy saturates under overparameterization, spectral gap correlates with generalization, and KL divergence across adjacent regions reflects functional smoothness. Provides new bounds on capacity and generalization.

Conclusion: Provides unified framework for analyzing ReLU networks through discrete functional geometry, offering new tools to understand, diagnose, and improve generalization with structural insights governing network behavior.

Abstract: We extend the ReLU Transition Graph (RTG) framework into a comprehensive graph-theoretic model for understanding deep ReLU networks. In this model, each node represents a linear activation region, and edges connect regions that differ by a single ReLU activation flip, forming a discrete geometric structure over the network’s functional behavior. We prove that RTGs at random initialization exhibit strong expansion, binomial degree distributions, and spectral properties that tightly govern generalization. These structural insights enable new bounds on capacity via region entropy and on generalization via spectral gap and edge-wise KL divergence. Empirically, we construct RTGs for small networks, measure their smoothness and connectivity properties, and validate theoretical predictions. Our results show that region entropy saturates under overparameterization, spectral gap correlates with generalization, and KL divergence across adjacent regions reflects functional smoothness. This work provides a unified framework for analyzing ReLU networks through the lens of discrete functional geometry, offering new tools to understand, diagnose, and improve generalization.

[386] EvolveSignal: A Large Language Model Powered Coding Agent for Discovering Traffic Signal Control Algorithms

Leizhen Wang, Peibo Duan, Hao Wang, Yue Wang, Jian Xu, Nan Zheng, Zhenliang Ma

Main category: cs.LG

TL;DR: EvolveSignal uses LLMs and evolutionary search to automatically discover traffic signal control algorithms that outperform traditional Webster’s method, reducing delays by 20.1% and stops by 47.1%.

Details

Motivation: Traditional fixed-time traffic signal control relies on manual engineering and hand-crafted formulas that are labor-intensive and suboptimal under congested or heterogeneous traffic conditions.

Method: Formulates traffic signal control as program synthesis using Python functions with fixed input-output structures, optimized through external evaluations (traffic simulator) and evolutionary search powered by large language models.

Result: Discovered algorithms significantly outperform Webster’s baseline with 20.1% reduction in average delay and 47.1% reduction in average stops, while providing practical insights for traffic engineers.

Conclusion: This work establishes a new research direction by leveraging AI for automated algorithm design in traffic signal control, bridging program synthesis with transportation engineering.

Abstract: In traffic engineering, the fixed-time traffic signal control remains widely used for its low cost, stability, and interpretability. However, its design depends on hand-crafted formulas (e.g., Webster) and manual re-timing by engineers to adapt to demand changes, which is labor-intensive and often yields suboptimal results under heterogeneous or congested conditions. This paper introduces the EvolveSignal, a large language models (LLMs) powered coding agent to automatically discover new traffic signal control algorithms. We formulate the problem as program synthesis, where candidate algorithms are represented as Python functions with fixed input-output structures, and iteratively optimized through external evaluations (e.g., a traffic simulator) and evolutionary search. Experiments on a signalized intersection demonstrate that the discovered algorithms outperform Webster’s baseline, reducing average delay by 20.1% and average stops by 47.1%. Beyond performance, ablation and incremental analyses reveal that EvolveSignal modifications-such as adjusting cycle length bounds, incorporating right-turn demand, and rescaling green allocations-can offer practically meaningful insights for traffic engineers. This work opens a new research direction by leveraging AI for algorithm design in traffic signal control, bridging program synthesis with transportation engineering.

[387] Invariant Features for Global Crop Type Classification

Xin-Yi Tong, Sherrie Wang

Main category: cs.LG

TL;DR: This paper introduces CropGlobe, a global crop dataset with 300K samples, and proposes CropNet with temporal augmentation to improve cross-regional crop classification using Sentinel-2 features that show strong geographic invariance.

Details

Motivation: Limited ground sample availability constrains large-scale crop classification across geographic regions, requiring solutions for performance decline under geospatial shifts.

Method: Constructed CropGlobe dataset with 300K samples from 8 countries, compared transferability of Sentinel-2 temporal features and EMIT hyperspectral features, designed CropNet CNN with temporal data augmentation (time shift, scale, magnitude warping).

Result: 2D median temporal features from Sentinel-2 showed strongest invariance across all transfer scenarios, and temporal augmentation further improved robustness, especially with limited training data diversity.

Conclusion: Identified invariant feature representations that enhance geographic transferability, providing a path toward scalable, low-cost crop type applications across diverse global regions.

Abstract: Accurately obtaining crop type and its spatial distribution at a global scale is critical for food security, agricultural policy-making, and sustainable development. Remote sensing offers an efficient solution for large-scale crop classification, but the limited availability of reliable ground samples in many regions constrains applicability across geographic areas. To address performance declines under geospatial shifts, this study identifies remote sensing features that are invariant to geographic variation and proposes strategies to enhance cross-regional generalization. We construct CropGlobe, a global crop type dataset with 300,000 pixel-level samples from eight countries across five continents, covering six major food and industrial crops (corn, soybeans, rice, wheat, sugarcane, cotton). With broad geographic coverage, CropGlobe enables a systematic evaluation under cross-country, cross-continent, and cross-hemisphere transfer. We compare the transferability of temporal multi-spectral features (Sentinel-2-based 1D/2D median features and harmonic coefficients) and hyperspectral features (from EMIT). To improve generalization under spectral and phenological shifts, we design CropNet, a lightweight and robust CNN tailored for pixel-level crop classification, coupled with temporal data augmentation (time shift, time scale, and magnitude warping) that simulates realistic cross-regional phenology. Experiments show that 2D median temporal features from Sentinel-2 consistently exhibit the strongest invariance across all transfer scenarios, and augmentation further improves robustness, particularly when training data diversity is limited. Overall, the work identifies more invariant feature representations that enhance geographic transferability and suggests a promising path toward scalable, low-cost crop type applications across globally diverse regions.

cs.MA

[388] SAMVAD: A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India

Prathamesh Devadiga, Omkaar Jayadev Shetty, Pooja Agarwal

Main category: cs.MA

TL;DR: SAMVAD is a Multi-Agent System using LLMs to simulate Indian judicial deliberation with RAG for legal grounding

Details

Motivation: Empirical studies of judicial panels face ethical and practical barriers, requiring simulation approaches

Method: MAS with Judge, Prosecution, Defense Counsel, and Adjudicator agents powered by LLMs with RAG using Indian legal documents

Result: System enables legally sound instructions/arguments with citations and consensus-based verdicts through iterative deliberation

Conclusion: Provides configurable, explainable platform for exploring legal reasoning in Indian judicial context with verifiable legal grounding

Abstract: Understanding the complexities of judicial deliberation is crucial for assessing the efficacy and fairness of a justice system. However, empirical studies of judicial panels are constrained by significant ethical and practical barriers. This paper introduces SAMVAD, an innovative Multi-Agent System (MAS) designed to simulate the deliberation process within the framework of the Indian justice system. Our system comprises agents representing key judicial roles: a Judge, a Prosecution Counsel, a Defense Counsel, and multiple Adjudicators (simulating a judicial bench), all powered by large language models (LLMs). A primary contribution of this work is the integration of Retrieval-Augmented Generation (RAG), grounded in a domain-specific knowledge base of landmark Indian legal documents, including the Indian Penal Code and the Constitution of India. This RAG functionality enables the Judge and Counsel agents to generate legally sound instructions and arguments, complete with source citations, thereby enhancing both the fidelity and transparency of the simulation. The Adjudicator agents engage in iterative deliberation rounds, processing case facts, legal instructions, and arguments to reach a consensus-based verdict. We detail the system architecture, agent communication protocols, the RAG pipeline, the simulation workflow, and a comprehensive evaluation plan designed to assess performance, deliberation quality, and outcome consistency. This work provides a configurable and explainable MAS platform for exploring legal reasoning and group decision-making dynamics in judicial simulations, specifically tailored to the Indian legal context and augmented with verifiable legal grounding via RAG.

cs.MM

eess.AS

[389] Hierarchical Sparse Sound Field Reconstruction with Spherical and Linear Microphone Arrays

Shunxi Xu, Craig T. Jin

Main category: eess.AS

TL;DR: A two-stage sparse recovery framework combining spherical and linear microphone arrays to enhance spatial resolution and robustness in reverberant environments

Details

Motivation: Spherical microphone arrays have limited spatial resolution due to spherical harmonic order constraints and performance degradation in reverberant environments

Method: Two-stage sparse recovery with residue refinement that integrates a central spherical microphone array (primary estimator) with four surrounding linear microphone arrays (spatially complementary refiner)

Result: Significantly enhanced spatial energy map reconstruction under varying reverberation conditions compared to SMA-only and direct one-step joint processing

Conclusion: The proposed SMA-LMA framework effectively enhances spatial fidelity and robustness in complex acoustic environments by exploiting complementary spatial characteristics

Abstract: Spherical microphone arrays (SMAs) are widely used for sound field analysis, and sparse recovery (SR) techniques can significantly enhance their spatial resolution by modeling the sound field as a sparse superposition of dominant plane waves. However, the spatial resolution of SMAs is fundamentally limited by their spherical harmonic order, and their performance often degrades in reverberant environments. This paper proposes a two-stage SR framework with residue refinement that integrates observations from a central SMA and four surrounding linear microphone arrays (LMAs). The core idea is to exploit complementary spatial characteristics by treating the SMA as a primary estimator and the LMAs as a spatially complementary refiner. Simulation results demonstrate that the proposed SMA-LMA method significantly enhances spatial energy map reconstruction under varying reverberation conditions, compared to both SMA-only and direct one-step joint processing. These results demonstrate the effectiveness of the proposed framework in enhancing spatial fidelity and robustness in complex acoustic environments.

[390] LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis

Gaspard Michel, Elena V. Epure, Christophe Cerisara

Main category: eess.AS

TL;DR: LibriQuote dataset provides 18K hours of English speech (12.7K neutral + 5.3K expressive) from audiobooks with context and pseudo-labels for expressive TTS training and benchmarking.

Details

Motivation: Existing TTS systems lack large-scale expressive speech datasets, and current expressive corpora are too small for effective training and benchmarking.

Method: Created LibriQuote dataset from audiobooks with expressive speech from character quotations, including context and pseudo-labels. Provides test set for zero-shot expressive TTS evaluation.

Result: Fine-tuning on LibriQuote improves speech intelligibility. Current systems fail to match ground-truth expressiveness and naturalness in subjective/objective evaluations.

Conclusion: LibriQuote enables better expressive TTS training and benchmarking, showing significant room for improvement in current expressive synthesis capabilities.

Abstract: Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation (\textit{e.g. ``he whispered softly’’}). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system’s ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at https://libriquote.github.io/.

[391] Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation

Tobias Raichle, Niels Edinger, Bin Yang

Main category: eess.AS

TL;DR: LaDen is a test-time adaptation method for speech enhancement that uses pre-trained speech representations to perform latent denoising, enabling effective adaptation to new domains without labeled target data.

Details

Motivation: Deep learning speech enhancement models perform well when test distributions match training conditions but degrade in unpredictable real-world environments with domain shifts, requiring a solution for better generalization.

Method: Leverages pre-trained speech representations to perform latent denoising through linear transformation of noisy embeddings, enabling pseudo-labeling for target domains without labeled data.

Result: LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts, as demonstrated in comprehensive benchmark experiments.

Conclusion: The proposed latent denoising approach effectively enables test-time adaptation of speech enhancement models across diverse acoustic environments, demonstrating strong generalization capabilities for various domain shifts.

Abstract: Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present LaDen (latent denoising), the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.

[392] Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware

Hannes Rosseel, Toon van Waterschoot

Main category: eess.AS

TL;DR: GPU-accelerated real-time acoustic auralization system for highly reverberant spaces that reduces latency compared to CPU-based approaches.

Details

Motivation: Interactive acoustic auralization of concert halls and historical worship spaces requires real-time convolution with long filters, but traditional CPU-based methods introduce significant latency that limits interactivity.

Method: Implementation of a real-time multichannel loudspeaker-based auralization system using GPU-acceleration for convolution and integrated acoustic feedback cancellation on the GPU.

Result: GPU-accelerated convolution achieves real-time performance with significantly lower latency compared to traditional CPU-based convolution.

Conclusion: The unified GPU-based framework enables real-time interactive auralization of highly reverberant spaces while minimizing processing latency.

Abstract: Interactive acoustic auralization allows users to explore virtual acoustic environments in real-time, enabling the acoustic recreation of concert hall or Historical Worship Spaces (HWS) that are either no longer accessible, acoustically altered, or impractical to visit. Interactive acoustic synthesis requires real-time convolution of input signals with a set of synthesis filters that model the space-time acoustic response of the space. The acoustics in concert halls and HWS are both characterized by a long reverberation time, resulting in synthesis filters containing many filter taps. As a result, the convolution process can be computationally demanding, introducing significant latency that limits the real-time interactivity of the auralization system. In this paper, the implementation of a real-time multichannel loudspeaker-based auralization system is presented. This system is capable of synthesizing the acoustics of highly reverberant spaces in real-time using GPU-acceleration. A comparison between traditional CPU-based convolution and GPU-accelerated convolution is presented, showing that the latter can achieve real-time performance with significantly lower latency. Additionally, the system integrates acoustic synthesis with acoustic feedback cancellation on the GPU, creating a unified loudspeaker-based auralization framework that minimizes processing latency.

[393] Window Function-less DFT with Reduced Noise and Latency for Real-Time Music Analysis

Cai Biesinger, Hiromitsu Awano, Masanori Hashimoto

Main category: eess.AS

TL;DR: A DFT-based algorithm for music analysis that provides high time and frequency resolution with reduced noise and low computational requirements, using exponentially spaced bins that map directly to musical notes.

Details

Motivation: Music analysis applications require algorithms with high time and frequency resolution while minimizing noise in noisy signals, with additional demands for low latency and low computational requirements for real-time applications.

Method: Extends a DFT-based method that post-processes DFT output without window functions, using exponentially spaced output bins that directly map to musical notes.

Result: The approach yields greatly reduced sidelobes and noise, improves time resolution without sacrificing frequency resolution, and outperforms existing FFT and DFT-based approaches.

Conclusion: The improved performance enables better real-time visualizations and contributes to enhanced analysis quality in applications like automatic music transcription.

Abstract: Music analysis applications demand algorithms that can provide both high time and frequency resolution while minimizing noise in an already-noisy signal. Real-time analysis additionally demands low latency and low computational requirements. We propose a DFT-based algorithm that accomplishes all these requirements by extending a method that post-processes DFT output without the use of window functions. Our approach yields greatly reduced sidelobes and noise, and improves time resolution without sacrificing frequency resolution. We use exponentially spaced output bins which directly map to notes in music. The resulting improved performance, compared to existing FFT and DFT-based approaches, creates possibilities for improved real-time visualizations, and contributes to improved analysis quality in other applications such as automatic transcription.

[394] Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints

Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer

Main category: eess.AS

TL;DR: Training-free AI speech detection using audio residuals that achieves >99% AUROC for model attribution and synthetic speech detection across diverse systems.

Details

Motivation: Address growing threats from AI-generated speech misuse (impersonation, misinformation, spoofing) by developing effective detection methods.

Method: Uses standardized average residuals - differences between input audio and filtered versions via low-pass filter or EnCodec autoencoder - to capture synthesis artifacts as model fingerprints.

Result: Achieves AUROC scores exceeding 99% in most scenarios for single-model attribution, multi-model attribution, and synthetic vs real detection across multiple synthesis systems.

Conclusion: Provides simple, efficient, and robust method for digital forensics with strong generalization across speech synthesis systems and languages.

Abstract: As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method’s ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

[395] SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li

Main category: eess.AS

TL;DR: SongBloom is a novel framework for full-length song generation that combines autoregressive sketching with diffusion-based refinement to create coherent, high-fidelity music with harmonious instrumental and vocal elements.

Details

Motivation: Existing language models and diffusion methods struggle to balance global coherence with local fidelity in song generation, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics.

Method: Uses an interleaved paradigm of autoregressive sketching and diffusion-based refinement, gradually extending musical sketches from short to long and refining details from coarse to fine-grained while integrating prior semantic and acoustic context.

Result: Outperforms existing methods across both subjective and objective metrics and achieves performance comparable to state-of-the-art commercial music generation platforms.

Conclusion: SongBloom effectively addresses the challenge of generating coherent, high-quality full-length songs by combining the strengths of autoregressive and diffusion models in an interleaved generation framework.

Abstract: Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .

[396] CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025

Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee

Main category: eess.AS

TL;DR: CUHK’s vTAD systems for NCMMSC 2025 challenge use WavLM-Large embeddings with ASTP pooling and Diff-Net variants (FFN and SE-ResFFN) for timbre attribute comparison between utterances, showing trade-off between model complexity and generalization.

Details

Motivation: To develop robust voice timbre attribute detection systems for fine-grained speaker modeling and address challenges in speaker identity, annotation subjectivity, and data imbalance.

Method: Leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) for speaker representations, followed by two Diff-Net variants: Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN) for timbre attribute intensity comparison.

Result: WavLM-Large+FFN achieved 77.96% accuracy and 21.79% EER for unseen speakers, while WavLM-Large+SE-ResFFN achieved 94.42% accuracy and 5.49% EER for seen speakers.

Conclusion: Architectural choices significantly impact fine-grained speaker modeling, with trade-off between model complexity and generalization. Future work should focus on improving robustness and fairness in timbre attribute detection.

Abstract: This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model excels in the ‘Seen’ setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.

[397] MultiGen: Child-Friendly Multilingual Speech Generator with LLMs

Xiaoxue Gao, Huayun Zhang, Nancy F. Chen

Main category: eess.AS

TL;DR: MultiGen is a multilingual speech generation model using LLM architecture for child-friendly interactions in low-resource languages including Singaporean Mandarin, Malay, and Tamil.

Details

Motivation: Improve human-machine interactions for children through high-quality, culturally relevant speech generation in low-resource languages where existing solutions are limited.

Method: Leverages LLM architecture for multilingual speech generation, integrating age-appropriate content and culturally relevant contexts for three low-resource languages.

Result: Experimental results show superior performance over baseline methods in both objective metrics and subjective evaluations.

Conclusion: MultiGen successfully addresses the challenge of child-friendly speech generation for low-resource languages, demonstrating effective multilingual capabilities through LLM-based architecture.

Abstract: Generative speech models have demonstrated significant potential in improving human-machine interactions, offering valuable real-world applications such as language learning for children. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiGen, a multilingual speech generation model with child-friendly interaction, leveraging LLM architecture for speech generation tailored for low-resource languages. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, which can be used to facilitate young children’s communication with AI systems through culturally relevant context in three low-resource languages: Singaporean accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiGen compared to baseline methods.

[398] From Evaluation to Optimization: Neural Speech Assessment for Downstream Applications

Yu Tsao

Main category: eess.AS

TL;DR: Review of neural network-based speech assessment models that serve as differentiable perceptual proxies for optimizing speech enhancement/synthesis and enable detection of salient speech characteristics for downstream processing.

Details

Motivation: Traditional subjective listening tests are costly and time-consuming, while objective metrics require clean reference signals that are often unavailable in real-world applications.

Method: Analysis of recent neural network-based speech assessment models that predict quality and intelligibility without requiring clean reference signals.

Result: These models achieve promising results and are increasingly integrated into downstream speech processing tasks as differentiable perceptual proxies.

Conclusion: Neural speech assessment models show great potential but current limitations need to be addressed for further advancement in speech processing pipelines.

Abstract: The evaluation of synthetic and processed speech has long been a cornerstone of audio engineering and speech science. Although subjective listening tests remain the gold standard for assessing perceptual quality and intelligibility, their high cost, time requirements, and limited scalability present significant challenges in the rapid development cycles of modern speech technologies. Traditional objective metrics, while computationally efficient, often rely on a clean reference signal, making them intrusive approaches. This presents a major limitation, as clean signals are often unavailable in real-world applications. In recent years, numerous neural network-based speech assessment models have been developed to predict quality and intelligibility, achieving promising results. Beyond their role in evaluation, these models are increasingly integrated into downstream speech processing tasks. This review focuses on their role in two main areas: (1) serving as differentiable perceptual proxies that not only assess but also guide the optimization of speech enhancement and synthesis models; and (2) enabling the detection of salient speech characteristics to support more precise and efficient downstream processing. Finally, we discuss current limitations and outline future research directions to further advance the integration of speech assessment into speech processing pipelines.

[399] Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM

Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: Proposes iMTI-Net, an improved multi-target intelligibility prediction network that uses uncertainty-aware Whisper embeddings with statistical features and CNN-sLSTM architecture for better speech intelligibility prediction.

Details

Motivation: Non-intrusive speech intelligibility prediction is challenging due to variability in speakers, noise conditions, and subjective perception.

Method: Uses uncertainty-aware Whisper embeddings with mean, standard deviation, and entropy features. Employs scalar LSTM for sequential modeling and proposes iMTI-Net with CNN-sLSTM architecture in multitask learning framework to predict both human intelligibility scores and machine-based WER.

Result: iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics.

Conclusion: The approach demonstrates effectiveness of incorporating uncertainty-aware features and CNN-sLSTM architecture for improved speech intelligibility prediction.

Abstract: Non-intrusive speech intelligibility prediction remains challenging due to variability in speakers, noise conditions, and subjective perception. We propose an uncertainty-aware approach that leverages Whisper embeddings in combination with statistical features, specifically the mean, standard deviation, and entropy computed across the embedding dimensions. The entropy, computed via a softmax over the feature dimension, serves as a proxy for uncertainty, complementing global information captured by the mean and standard deviation. To model the sequential structure of speech, we adopt a scalar long short-term memory (sLSTM) network, which efficiently captures long-range dependencies. Building on this foundation, we propose iMTI-Net, an improved multi-target intelligibility prediction network that integrates convolutional neural network (CNN) and sLSTM components within a multitask learning framework. It jointly predicts human intelligibility scores and machine-based word error rates (WER) from Google ASR and Whisper. Experimental results show that iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics, demonstrating the effectiveness of incorporating uncertainty-aware features and the CNN-sLSTM architecture.

eess.IV

[400] Latent Space Single-Pixel Imaging Under Low-Sampling Conditions

Chenyu Yuan

Main category: eess.IV

TL;DR: LSSPI migrates single-pixel imaging to latent space, achieving superior reconstruction quality, blind denoising, and computational efficiency even at low sampling rates.

Details

Motivation: Traditional deep learning networks for single-pixel imaging operate in pixel space, limiting performance. The authors aim to enhance imaging capabilities by moving to latent space representation.

Method: Proposed LSSPI framework that conducts single-pixel imaging reconstruction and generation tasks in latent space rather than traditional pixel space.

Result: Significantly improved imaging under low sampling rates, higher SNR, richer details, blind denoising capability, better high-frequency recovery, and superior parameter efficiency and reconstruction speed.

Conclusion: LSSPI provides an ideal solution for low-sampling single-pixel imaging applications, driving practical implementation of the technology with excellent computational efficiency.

Abstract: In recent years, the introduction of deep learning into the field of single-pixel imaging has garnered significant attention. However, traditional networks often operate within the pixel space. To address this, we innovatively migrate single-pixel imaging to the latent space, naming this framework LSSPI (Latent Space Single-Pixel Imaging). Within the latent space, we conduct in-depth explorations into both reconstruction and generation tasks for single-pixel imaging. Notably, this approach significantly enhances imaging capabilities even under low sampling rate conditions. Compared to conventional deep learning networks, LSSPI not only reconstructs images with higher signal-to-noise ratios (SNR) and richer details under equivalent sampling rates but also enables blind denoising and effective recovery of high-frequency information. Furthermore, by migrating single-pixel imaging to the latent space, LSSPI achieves superior advantages in terms of model parameter efficiency and reconstruction speed. Its excellent computational efficiency further positions it as an ideal solution for low-sampling single-pixel imaging applications, effectively driving the practical implementation of single-pixel imaging technology.

[401] Neural Video Compression with In-Loop Contextual Filtering and Out-of-Loop Reconstruction Enhancement

Yaojun Wu, Chaoyi Lin, Yiming Wang, Semih Esenlik, Zhaobin Zhang, Kai Zhang, Li Zhang

Main category: eess.IV

TL;DR: Systematic study of enhancement filtering in neural video compression, achieving 7.71% bit rate reduction through adaptive in-loop filtering and out-of-loop reconstruction enhancement.

Details

Motivation: To improve neural video compression efficiency by systematically applying enhancement filtering techniques, addressing challenges of error propagation and adaptive filtering application throughout video sequences.

Method: Categorized enhancement filtering into in-loop contextual filtering (mitigates error propagation during encoding) and out-of-loop reconstruction enhancement (refines frame quality). Introduced adaptive coding decision strategy to dynamically determine filtering application during encoding.

Result: Achieved 7.71% reduction in bit rate compared to state-of-the-art neural video codecs through extensive experiments.

Conclusion: The proposed enhancement filtering approach provides effective improvement in coding efficiency and represents the first systematic study of such techniques in conditional-based neural video compression.

Abstract: This paper explores the application of enhancement filtering techniques in neural video compression. Specifically, we categorize these techniques into in-loop contextual filtering and out-of-loop reconstruction enhancement based on whether the enhanced representation affects the subsequent coding loop. In-loop contextual filtering refines the temporal context by mitigating error propagation during frame-by-frame encoding. However, its influence on both the current and subsequent frames poses challenges in adaptively applying filtering throughout the sequence. To address this, we introduce an adaptive coding decision strategy that dynamically determines filtering application during encoding. Additionally, out-of-loop reconstruction enhancement is employed to refine the quality of reconstructed frames, providing a simple yet effective improvement in coding efficiency. To the best of our knowledge, this work presents the first systematic study of enhancement filtering in the context of conditional-based neural video compression. Extensive experiments demonstrate a 7.71% reduction in bit rate compared to state-of-the-art neural video codecs, validating the effectiveness of the proposed approach.

[402] EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding

Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun

Main category: eess.IV

TL;DR: EHVC is an efficient hierarchical neural video codec that addresses reference-quality mismatch through hierarchical multi-reference scheme, enhances quality structure with lookahead strategy, and stabilizes quality with layer-wise quality scale training.

Details

Motivation: Neural video codecs show superior efficiency but lack proper alignment between reference structures and hierarchical quality structures, with room for further optimization of quality structures.

Method: Three key innovations: 1) hierarchical multi-reference scheme aligning reference and quality structures, 2) encoder-side lookahead strategy using future frames, 3) layer-wise quality scale with random quality training for stability.

Result: EHVC achieves significantly superior performance compared to state-of-the-art neural video codecs.

Conclusion: The proposed EHVC successfully addresses reference-quality mismatch and optimizes hierarchical quality structures, demonstrating substantial improvements in neural video coding efficiency.

Abstract: Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: https://github.com/bytedance/NEVC.

[403] Spatial-aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging

Mona Ashtari-Majlan, David Masip

Main category: eess.IV

TL;DR: Novel deep learning framework for glaucoma detection using 3D OCT imaging with Vision Transformer and bidirectional GRU, achieving superior performance over state-of-the-art methods.

Details

Motivation: Glaucoma is a leading cause of irreversible blindness that requires early detection for timely intervention to prevent vision loss. 3D OCT imaging provides valuable diagnostic information that can be leveraged for automated detection.

Method: Integration of pre-trained Vision Transformer for slice-wise feature extraction from retinal data and bidirectional Gated Recurrent Unit (GRU) for capturing inter-slice spatial dependencies. This dual-component approach analyzes both local nuances and global structural integrity.

Result: Superior performance on large dataset with F1-score of 93.01%, Matthews Correlation Coefficient of 69.33%, and AUC of 94.20%, outperforming state-of-the-art methods.

Conclusion: The framework effectively leverages 3D OCT data and shows significant potential for enhancing clinical decision support systems and improving patient outcomes in glaucoma management.

Abstract: Glaucoma, a leading cause of irreversible blindness, necessitates early detection for accurate and timely intervention to prevent irreversible vision loss. In this study, we present a novel deep learning framework that leverages the diagnostic value of 3D Optical Coherence Tomography (OCT) imaging for automated glaucoma detection. In this framework, we integrate a pre-trained Vision Transformer on retinal data for rich slice-wise feature extraction and a bidirectional Gated Recurrent Unit for capturing inter-slice spatial dependencies. This dual-component approach enables comprehensive analysis of local nuances and global structural integrity, crucial for accurate glaucoma diagnosis. Experimental results on a large dataset demonstrate the superior performance of the proposed method over state-of-the-art ones, achieving an F1-score of 93.01%, Matthews Correlation Coefficient (MCC) of 69.33%, and AUC of 94.20%. The framework’s ability to leverage the valuable information in 3D OCT data holds significant potential for enhancing clinical decision support systems and improving patient outcomes in glaucoma management.

[404] AutoPETIII: The Tracer Frontier. What Frontier?

Zacharia Mesbah, Léo Mottay, Romain Modzelewski, Pierre Decazes, Sébastien Hapdey, Su Ruan, Sébastien Thureau

Main category: eess.IV

TL;DR: Using nnUNetv2 framework with 6-fold ensembles and MIP-CNN for automatic PET/CT lesion segmentation across multiple tracers (FDG and PSMA) without prior tracer knowledge.

Details

Motivation: Address the challenge of developing automatic lesion segmentation algorithms that work across different PET tracers (FDG and PSMA) without requiring tracer identification, as presented in the 2024 AutoPET competition.

Method: Trained two sets of 6-fold ensembles using nnUNetv2 framework for PET/CT lesion segmentation, combined with a MIP-CNN to select the appropriate model set for segmentation based on the input.

Result: Developed a fully automatic algorithm capable of performing lesion segmentation on PET/CT scans without knowing whether the tracer is FDG or PSMA-based.

Conclusion: The proposed approach using ensemble models with a selection mechanism enables robust lesion segmentation across multiple tracer types, addressing a key challenge in medical imaging analysis.

Abstract: For the last three years, the AutoPET competition gathered the medical imaging community around a hot topic: lesion segmentation on Positron Emitting Tomography (PET) scans. Each year a different aspect of the problem is presented; in 2024 the multiplicity of existing and used tracers was at the core of the challenge. Specifically, this year’s edition aims to develop a fully automatic algorithm capable of performing lesion segmentation on a PET/CT scan, without knowing the tracer, which can either be a FDG or PSMA-based tracer. In this paper we describe how we used the nnUNetv2 framework to train two sets of 6 fold ensembles of models to perform fully automatic PET/CT lesion segmentation as well as a MIP-CNN to choose which set of models to use for segmentation.

[405] Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, An Ran Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Y. Cheung, Pearse A. Keane, Yih Chung Tham

Main category: eess.IV

TL;DR: DINOv2 general-purpose vision foundation model outperforms retina-specific RETFound in ocular disease detection tasks, while RETFound excels in systemic disease prediction from retinal images, showing task-specific strengths of general vs domain-specific models.

Details

Motivation: To compare the performance of general-purpose vision foundation models (DINOv2) versus domain-specific retinal foundation models (RETFound) across various clinical tasks in ophthalmology, as the applicability of general-purpose models to medical domains remains underexplored.

Method: Head-to-head evaluation by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks across eight standardized open-source ocular datasets plus Moorfields AlzEye and UK Biobank datasets.

Result: DINOv2-large outperformed RETFound in diabetic retinopathy detection (AUROC=0.850-0.952 vs 0.823-0.944) and multi-class eye diseases (AUROC=0.892 vs 0.846). DINOv2-base outperformed in glaucoma detection (AUROC=0.958 vs 0.940). RETFound outperformed all DINOv2 models in predicting systemic diseases like heart failure, myocardial infarction, and stroke (AUROC=0.732-0.796 vs 0.663-0.771). Trends persisted with limited data.

Conclusion: General-purpose and domain-specific foundation models excel in distinct clinical scenarios, emphasizing the importance of task-specific model selection to optimize clinical performance in medical AI applications.

Abstract: The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.

[406] LotteryCodec: Searching the Implicit Representation in a Random Network for Low-Complexity Image Compression

Haotian Wu, Gongpu Chen, Pier Luigi Dragotti, Deniz Gündüz

Main category: eess.IV

TL;DR: Untrained subnetworks in random networks can achieve state-of-the-art image compression by encoding image statistics into network substructure through binary mask overfitting.

Details

Motivation: To explore whether untrained subnetworks within randomly initialized networks can serve as effective synthesis networks for image compression, potentially eliminating the need for network training while achieving competitive rate-distortion performance.

Method: Proposed LotteryCodec which overfits a binary mask to individual images using an over-parameterized randomly initialized network shared by encoder and decoder, with a rewind modulation mechanism to address over-parameterization challenges and streamline subnetwork search.

Result: LotteryCodec outperforms VTM and sets new state-of-the-art in single-image compression, while also enabling adaptive decoding complexity through adjustable mask ratios for flexible compression solutions.

Conclusion: The lottery codec hypothesis is validated - untrained subnetworks can achieve excellent compression performance, opening a new paradigm for image compression that encodes statistics into network substructure rather than training weights.

Abstract: We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving rate-distortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parameterization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.

[407] Ensemble of Pathology Foundation Models for MIDOG 2025 Track 2: Atypical Mitosis Classification

Mieko Ochi, Bae Yuan

Main category: eess.IV

TL;DR: Leveraging pathology foundation models and ConvNeXt V2 with parameter-efficient fine-tuning and ensembling to accurately differentiate typical vs atypical mitotic figures for improved cancer prognostication.

Details

Motivation: Atypical mitotic figures strongly correlate with tumor aggressiveness but are challenging to differentiate even for expert pathologists, creating a need for automated accurate classification to improve patient prognostication and resource allocation.

Method: Used Pathology Foundation Models pre-trained on large histopathology datasets with parameter-efficient fine-tuning via low-rank adaptation. Incorporated ConvNeXt V2 architecture, employed fisheye transform to emphasize mitoses, Fourier Domain Adaptation with ImageNet targets, and ensembled multiple PFMs.

Result: Achieved competitive balanced accuracy on the Preliminary Evaluation Phase dataset.

Conclusion: The ensemble approach combining multiple pathology foundation models with advanced architectural components and domain adaptation techniques provides an effective solution for accurate mitotic figure classification.

Abstract: Mitotic figures are classified into typical and atypical variants, with atypical counts correlating strongly with tumor aggressiveness. Accurate differentiation is therefore essential for patient prognostication and resource allocation, yet remains challenging even for expert pathologists. Here, we leveraged Pathology Foundation Models (PFMs) pre-trained on large histopathology datasets and applied parameter-efficient fine-tuning via low-rank adaptation. In addition, we incorporated ConvNeXt V2, a state-of-the-art convolutional neural network architecture, to complement PFMs. During training, we employed a fisheye transform to emphasize mitoses and Fourier Domain Adaptation using ImageNet target images. Finally, we ensembled multiple PFMs to integrate complementary morphological insights, achieving competitive balanced accuracy on the Preliminary Evaluation Phase dataset.

Today’s Research Highlights

Table of Contents

cs.CL

[1] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

[2] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

[3] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

[4] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

[6] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

[7] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

[8] Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts

[9] Real-Time Detection of Hallucinated Entities in Long-Form Generation

[10] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

[11] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

[12] QuesGenie: Intelligent Multimodal Question Generation

[13] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

[14] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

[15] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

[16] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

[17] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

[18] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

[19] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

[20] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

[21] NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task

[22] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

[23] Measuring How (Not Just Whether) VLMs Build Common Ground

[24] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

[25] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

[26] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

[27] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

[28] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

[29] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

[30] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

[31] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

[32] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

[33] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

[34] CANDY: Benchmarking LLMs’ Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

[35] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

[36] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

[37] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

[38] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

[39] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

[40] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

[41] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

[42] Arabic Chatbot Technologies in Education: An Overview

[43] Improving Narrative Classification and Explanation via Fine Tuned Language Models

[44] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

[45] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

[46] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

[47] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

[48] Explicit and Implicit Data Augmentation for Social Event Detection

[49] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

[50] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

[51] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

[52] Can Language Models Handle a Non-Gregorian Calendar?

[53] MyProfessors: Mining Turkish Student Reviews

[54] Mitigating Bias in Text Classification via Prompt-Based Text Transformation

[55] Exploring Linguistic Features for Turkish Text Readability

[56] R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

[57] DynaSaur: Large Language Agents Beyond Predefined Actions

[58] ACING: Actor-Critic for Instruction Learning in Black-Box LLMs

[59] Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts

[60] Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective

[61] A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

[62] An Unsupervised Natural Language Processing Pipeline for Assessing Referral Appropriateness

[63] HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents

[64] HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

[65] Autoformalization in the Wild: Assessing LLMs on Real-World Mathematical Definitions

[66] Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

[67] Rapid Word Learning Through Meta In-Context Learning

[68] FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

[69] Explicit Learning and the LLM in Machine Translation

[70] FutureGen: A RAG-based Approach to Generate the Future Work of Scientific Article

[71] EQ-Knight: A Memory-Augmented LLM Agent for Strategic Affective Gaming in Debt Recovery

[72] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

[73] Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

[74] MiniCPM4: Ultra-Efficient LLMs on End Devices

[75] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

[76] DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

[77] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges