Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 83]
cs.CV [Total: 147]
cs.AI [Total: 58]
cs.SD [Total: 11]
cs.LG [Total: 94]
cs.MA [Total: 4]
cs.MM [Total: 2]
eess.AS [Total: 4]
eess.IV [Total: 17]

cs.CL

[1] eSapiens’s DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi

Main category: cs.CL

TL;DR: DEREK is a secure enterprise document QA system that uses RAG with hybrid search, GPT-4o query refinement, and a verification mechanism to ensure all answers are properly grounded in source documents, achieving high accuracy on legal benchmarks.

Details

Motivation: Enterprise domains like legal and finance require secure, auditable, and context-faithful document question answering systems that can handle heterogeneous content while ensuring all claims are properly grounded in source materials with minimal operational overhead.

Method: The system uses a RAG pipeline with: 1) document ingestion and chunking into 1,000-token overlapping segments, 2) hybrid HNSW+BM25 indexing, 3) GPT-4o query refinement, 4) combined vector+BM25 search with Cohere reranking, 5) CO-STAR prompt engineering for LLM answering, and 6) LangGraph verifier that enforces citation overlap and regenerates answers until all claims are grounded.

Result: On LegalBench subsets: 1000-token chunks improved Recall@50 by ~1 percentage point, hybrid search + reranking boosted Precision@10 by ~7 percentage points, the verifier raised TRACe Utilization above 0.50, and limited unsupported statements to less than 3%. The system runs securely with end-to-end TLS 1.3 and AES-256 encryption.

Conclusion: DEREK successfully delivers accurate, traceable, and production-ready document QA that meets enterprise security and auditability requirements, providing a reliable baseline for high-stakes domains through its comprehensive verification mechanisms and robust architecture.

Abstract: We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.

[2] Adversarial Demonstration Learning for Low-resource NER Using Dual Similarity

Guowen Yuan, Tien-Hsuan Wu, Lianghao Xia, Ben Kao

Main category: cs.CL

TL;DR: This paper improves named entity recognition (NER) in low-resource scenarios by proposing dual similarity for demonstration selection (combining semantic and feature similarity) and adversarial demonstration training to force models to better utilize demonstration examples.

Details

Motivation: Existing demonstration-based NER methods in low-resource scenarios have two key limitations: (1) demonstration selection relies only on semantic similarity, missing important feature similarity, and (2) NER models have inadequate ability to effectively reference and utilize demonstration examples during tagging.

Method: The authors propose a two-part approach: (1) dual similarity demonstration selection that combines both semantic similarity and feature similarity for choosing better demonstration examples, and (2) adversarial demonstration training that forces the NER model to actively refer to demonstrations when performing tagging tasks.

Result: Comprehensive experiments on low-resource NER tasks show that the proposed method outperforms a range of existing methods, demonstrating significant performance improvements through better demonstration selection and training strategies.

Conclusion: The dual similarity approach for demonstration selection and adversarial training method effectively address the identified issues in demonstration-based NER, leading to superior performance in low-resource named entity recognition tasks compared to existing approaches.

Abstract: We study the problem of named entity recognition (NER) based on demonstration learning in low-resource scenarios. We identify two issues in demonstration construction and model training. Firstly, existing methods for selecting demonstration examples primarily rely on semantic similarity; We show that feature similarity can provide significant performance improvement. Secondly, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. We propose a demonstration and training approach that effectively addresses these issues. For the first issue, we propose to select examples by dual similarity, which comprises both semantic similarity and feature similarity. For the second issue, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. We conduct comprehensive experiments in low-resource NER tasks, and the results demonstrate that our method outperforms a range of methods.

[3] Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Altynbek Ismailov, Salia Asanova

Main category: cs.CL

TL;DR: This study evaluates how large language models handle different types of prompt modifications when solving coding problems, revealing that LLMs are overly robust to important semantic changes while being appropriately robust to irrelevant noise.

Details

Motivation: Current LLMs are expected to be robust to minor typos and noise in prompts, but this robustness may become problematic when they also ignore semantically important changes that should trigger different responses. The authors aim to understand where useful robustness ends and harmful insensitivity begins in code generation tasks.

Method: The researchers compiled 50 LeetCode problems and applied three types of minimal prompt perturbations: (1) progressive underspecification by deleting 10% of words per step, (2) lexical flips swapping critical quantifiers like “max” to “min”, and (3) jargon inflation replacing common nouns with obscure technical synonyms. Six frontier models including reasoning-tuned versions were tested, generating 11,853 total responses that were evaluated against original test suites.

Result: The study found a sharp double asymmetry: models remained correct in 85% of cases even with 90% of the prompt missing (over-robustness to underspecification), but only 54% adapted to critical quantifier flips that reverse the task meaning. Reasoning-tuned models were even less sensitive to semantic changes than their base versions. Jargon edits had intermediate effects at 56% pass-through rate. Masking salient anchors like function names could force model re-evaluation.

Conclusion: Current LLMs inappropriately blur the distinction between harmless noise and meaning-changing edits, often treating both as ignorable. The authors advocate for evaluation and training protocols that reward differential sensitivity - maintaining stability under benign noise while adapting or refusing when semantics truly change.

Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier (“max” to “min”); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three “reasoning-tuned” versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.

[4] Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

Sumit Singh, Rohit Mishra, Uma Shanker Tiwary

Main category: cs.CL

TL;DR: This study investigates Hindi Named Entity Recognition (NER) using pretrained encoders (MuRIL, XLM-R) and generative models (Llama-2, Llama-3, GPT-3.5) enhanced with Retrieval Augmentation (RA) from Wikipedia, showing that RA consistently improves NER performance across most models.

Details

Motivation: Named Entity Recognition (NER) remains a major challenge in natural language processing, particularly for resource-limited languages like Hindi. The study aims to improve Hindi NER performance by leveraging pretrained language models and exploring the potential of retrieval augmentation techniques to enhance model capabilities through external knowledge integration.

Method: The researchers employed Hindi-specific pretrained encoders (MuRIL and XLM-R) and generative models (Llama-2-7B, Llama-2-70B, Llama-3-70B, GPT-3.5-turbo) with and without Retrieval Augmentation (RA). They fine-tuned MuRIL, XLM-R, and Llama2-7B models, while using larger models (Llama2-70B, Llama3-70B, GPT-3.5-turbo) for few-shot NER generation. The RA approach incorporated relevant contextual data retrieved from Wikipedia to augment the training data.

Result: Models with Retrieval Augmentation outperformed baseline methods without RA in most cases. MuRIL’s macro F1 score improved from 0.69 to 0.70, while XLM-R showed significant improvement from 0.495 to 0.71 with RA. Fine-tuned Llama2-7B significantly outperformed the base model. Among generative models, GPT-3.5-turbo adapted well to RA, while Llama2-70B and Llama3-70B did not effectively utilize the retrieval context. RA showed particular effectiveness for low-context data scenarios.

Conclusion: Retrieval Augmentation significantly enhances NER performance across various pretrained models, with the effectiveness varying by model architecture. The study demonstrates that combining pretrained language models with external knowledge retrieval is a promising approach for improving NER in resource-limited languages like Hindi, providing valuable insights for data augmentation strategies and model optimization in low-resource language processing.

Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don’t incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

[5] Learning without training: The implicit dynamics of in-context learning

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

Main category: cs.CL

TL;DR: This paper investigates how Large Language Models achieve in-context learning by showing that transformer blocks can implicitly modify MLP weights based on context, potentially explaining LLMs’ ability to learn new patterns at inference time without weight updates.

Details

Motivation: The mechanisms behind in-context learning in LLMs are largely unknown despite being one of their most striking features. Understanding how LLMs can learn new patterns at inference time without weight updates is crucial for comprehending transformer capabilities.

Method: The authors analyze the interaction between self-attention and MLP layers in transformer blocks, using both theoretical analysis and experimentation. They examine how the stacking of these components allows implicit weight modification of the MLP layer based on context.

Result: Under mild simplifying assumptions, the study demonstrates that a transformer block can implicitly transform context into a low-rank weight-update of the MLP layer, providing a mechanistic explanation for in-context learning.

Conclusion: The simple mechanism of self-attention and MLP layer interaction may be the fundamental reason why LLMs can learn in context during inference rather than only during training, offering new insights into transformer architecture capabilities.

Abstract: One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.

[6] Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback

Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata

Main category: cs.CL

TL;DR: This paper investigates whether LLMs can provide meaningful writing feedback to creative writers by creating a corrupted story dataset and evaluating model performance through both automatic and human metrics.

Details

Motivation: To explore the potential of LLMs as writing assistants for creative writers and understand the challenges and limitations of AI-generated writing feedback in a systematic way.

Method: The researchers created a novel test dataset of 1,300 stories that were intentionally corrupted to introduce specific writing issues, then evaluated commonly used LLMs using both automatic metrics and human evaluation to assess their feedback quality.

Result: LLMs demonstrated strong out-of-the-box performance in providing specific and mostly accurate writing feedback, but they struggled with identifying the most significant writing issues in stories and determining appropriate tone (critical vs. positive feedback).

Conclusion: While current LLMs show promise for writing assistance with their ability to provide specific and accurate feedback, they have notable limitations in prioritizing major issues and calibrating feedback tone, indicating room for improvement in AI writing support systems.

Abstract: Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects – providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

[7] mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages

Hellina Hailu Nigatu, Min Li, Maartje ter Hoeve, Saloni Potdar, Sarah Chasins

Main category: cs.CL

TL;DR: The paper introduces mRAKL, a Retrieval-Augmented Generation system that reformulates multilingual Knowledge Graph Construction as a Question Answering task, achieving improved performance for low-resource languages like Tigrinya and Amharic through cross-lingual transfer from Arabic and English.

Details

Motivation: Knowledge Graph Construction in multilingual settings, particularly for low-resource languages, faces challenges in automatically constructing or predicting missing entities and relationships. Traditional approaches may not effectively leverage cross-lingual information or handle the complexity of multilingual knowledge representation.

Method: The authors reformulate multilingual Knowledge Graph Construction (mKGC) as a Question Answering task using mRAKL, a Retrieval-Augmented Generation (RAG) based system. The method uses head entities and linking relations as questions, with the model predicting tail entities as answers. They employ BM25 retriever and experiment with cross-lingual transfer from higher-resourced languages (Arabic and English) to low-resourced languages (Tigrinya and Amharic).

Result: The RAG-based approach shows improved performance over no-context settings. With an idealized retrieval system, mRAKL achieves accuracy improvements of 4.92 percentage points for Tigrinya and 8.79 percentage points for Amharic. The experiments demonstrate the effectiveness of cross-lingual transfer for low-resource language knowledge graph construction.

Conclusion: The study successfully demonstrates that reformulating multilingual Knowledge Graph Construction as a Question Answering task with Retrieval-Augmented Generation is effective, particularly for low-resource languages. The approach shows promising results for cross-lingual transfer and highlights the potential of RAG-based systems in multilingual knowledge graph tasks.

Abstract: Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.

[8] AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

Simon Baeuerle, Max Radyschevski, Ulrike Pado

Main category: cs.CL

TL;DR: This paper presents an end-to-end GenAI pipeline for automating meeting documentation and knowledge management in engineering departments, including transcription, minutes generation, and searchable chatbot interface, tested in real-world settings with survey data on technical and ethical aspects.

Details

Motivation: Large organizations face challenges with knowledge sharing through meetings, which consume significant work time and produce inconsistent documentation (minutes, notes, presentations), making information retrieval difficult and necessitating frequent update meetings and high-frequency schedules.

Method: Implementation of an end-to-end pipeline using Generative AI (particularly Large Language Models) to automate meeting documentation workflow: recording meetings, generating minutes through genAI, and creating a searchable chatbot interface for easy information retrieval. The system was tested in a real-world engineering department with extensive survey data collection on technical and ethical aspects.

Result: The real-world testing revealed three key findings: (a) users agreed that genAI could significantly reduce meeting effort, (b) technical aspects are largely already solved, and (c) organizational aspects are crucial for successful ethical usage of such systems.

Conclusion: GenAI-based meeting documentation systems show promise for reducing meeting overhead and improving knowledge management in engineering departments, with technical feasibility demonstrated but organizational and ethical considerations being critical factors for successful implementation.

Abstract: In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation – official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system.

[9] Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Main category: cs.CL

TL;DR: Step-Audio 2 is an end-to-end multi-modal large language model that combines latent audio encoding, reinforcement learning, and retrieval-augmented generation to achieve state-of-the-art performance in audio understanding and speech conversation, trained on millions of hours of audio data.

Details

Motivation: To develop an industry-strength audio understanding and speech conversation system that can handle genuine end-to-end speech conversation with enhanced responsiveness to paralinguistic information like speaking styles and emotions, while mitigating hallucination through external tool integration.

Method: The model integrates a latent audio encoder with reasoning-centric reinforcement learning (RL), incorporates discrete audio token generation into language modeling, and uses retrieval-augmented generation (RAG) with external tools like web search and audio search for timbre switching.

Result: Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions, demonstrating intelligence and expressiveness across diverse conversational scenarios.

Conclusion: Step-Audio 2 successfully delivers an end-to-end multi-modal solution for audio understanding and speech conversation that outperforms existing solutions by effectively combining advanced encoding techniques, reinforcement learning, and external tool integration to handle complex audio-linguistic tasks.

Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

[10] Deep Researcher with Test-Time Diffusion

Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee

Main category: cs.CL

TL;DR: The paper introduces TTD-DR (Test-Time Diffusion Deep Researcher), a novel framework that treats research report generation as a diffusion process, starting with a preliminary draft and iteratively refining it through denoising with dynamic retrieval, achieving state-of-the-art performance on complex research tasks.

Details

Motivation: Current deep research agents powered by LLMs experience performance plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. The authors aim to address this limitation by mimicking human research patterns of iterative searching, reasoning, and revision.

Method: TTD-DR conceptualizes research report generation as a diffusion process that begins with a preliminary draft serving as an evolving foundation. The framework iteratively refines this draft through a “denoising” process dynamically informed by a retrieval mechanism that incorporates external information at each step. A self-evolutionary algorithm is applied to each component of the agentic workflow to ensure high-quality context generation.

Result: TTD-DR achieves state-of-the-art results on benchmarks requiring intensive search and multi-hop reasoning, significantly outperforming existing deep research agents. The draft-centric design makes the report writing process more timely and coherent while reducing information loss during iterative search.

Conclusion: The TTD-DR framework successfully addresses the limitations of current research agents by introducing a diffusion-based approach that maintains coherence and reduces information loss. The method demonstrates superior performance on complex research tasks, establishing a new state-of-the-art for automated research report generation.

Abstract: Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a “denoising” process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.

[11] The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier

Main category: cs.CL

TL;DR: This study examines how different persona prompting strategies affect LLMs’ ability to simulate diverse sociodemographic groups, finding that prompt formulation significantly impacts simulation fidelity, with interview-style formats and name-based priming reducing stereotyping.

Details

Motivation: Large language models are increasingly used to simulate views of various sociodemographic groups through persona prompting, but concerns exist about the fidelity of such simulations due to how persona prompts are formulated affecting outcomes.

Method: Systematic examination using five open-source LLMs to test different persona prompt strategies (role adoption formats and demographic priming strategies) across 15 intersectional demographic groups in both open- and closed-ended tasks.

Result: LLMs struggle to simulate marginalized groups (particularly nonbinary, Hispanic, and Middle Eastern identities); interview-style formatting and name-based priming reduce stereotyping and improve alignment; smaller models like OLMo-2-7B outperform larger ones like Llama-3.3-70B.

Conclusion: The choice of demographic priming and role adoption strategy significantly impacts LLM portrayal of sociodemographic groups, and the study provides actionable guidance for designing better sociodemographic persona prompts in LLM-based simulation studies.

Abstract: Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

[12] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Mai Ali, Christopher Lucasius, Tanmay P. Patel, Madison Aitken, Jacob Vorstman, Peter Szatmari, Marco Battaglia, Deepa Kundur

Main category: cs.CL

TL;DR: This paper proposes a trimodal approach using speech-derived text, acoustic landmarks, and vocal biomarkers with multi-task learning to predict adolescent depression, suicidal ideation, and sleep disturbances, achieving 70.8% balanced accuracy on longitudinal data.

Details

Motivation: Speech is typically treated as a single modality for mental health analysis, but adolescent depression often co-occurs with multiple disorders like suicidal ideation and sleep disturbances, creating an opportunity to leverage multimodal speech data and multi-task learning for more comprehensive mental health prediction.

Method: The authors develop a trimodal multimedia approach that integrates three speech modalities (speech-derived text, acoustic landmarks, and vocal biomarkers) using large language model-based architectures. They employ multi-task learning (MTL) to simultaneously predict depression, suicidal ideation, and sleep disturbances, and implement longitudinal analysis to model temporal changes across multiple clinical interactions.

Result: The proposed trimodal, longitudinal multi-task learning approach achieved a balanced accuracy of 70.8% on the Depression Early Warning dataset, outperforming unimodal, single-task, and non-longitudinal baseline methods.

Conclusion: Treating speech as a trimodal data source combined with multi-task learning and longitudinal analysis provides superior performance for depression detection compared to traditional single-modality approaches, demonstrating the value of comprehensive multimodal frameworks for mental health prediction.

Abstract: Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions’ progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

[13] Efficient Compositional Multi-tasking for On-device Large Language Models

Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Main category: cs.CL

TL;DR: This paper introduces a benchmark and efficient method for compositional multi-tasking in LLMs on resource-constrained devices, where single test examples require simultaneous execution of multiple tasks like translation and summarization.

Details

Motivation: Existing work on task merging in LLMs is limited to single-task scenarios per test example. There's a need to address compositional multi-tasking where each test example involves simultaneous execution of multiple tasks, especially in on-device settings with computational constraints.

Method: The authors propose Learnable Calibration, an efficient method designed for on-device applications. They also create a benchmark comprising four practically relevant compositional tasks to facilitate research in this domain.

Result: The paper establishes a benchmark for compositional multi-tasking and demonstrates that their Learnable Calibration method is both resource-efficient and high-performing for on-device applications with limited computational resources.

Conclusion: The work lays groundwork for advancing LLM capabilities in real-world multi-tasking scenarios, expanding applicability to complex, resource-constrained use cases by enabling simultaneous execution of multiple tasks within single test examples.

Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

[14] BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset

Azizul Hakim Fayaz, MD. Shorif Uddin, Rayhan Uddin Bhuiyan, Zakia Sultana, Md. Samiul Islam, Bidyarthi Paul, Tashreef Muhammad, Shahriar Manzoor

Main category: cs.CL

TL;DR: This paper introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset covering three regional dialects (Barishal, Noakhali, and Chittagong) with 9,183 annotated instances to improve hate speech detection in linguistically diverse Bangladesh.

Details

Motivation: Existing hate speech detection systems for Bangla only work with standard language and fail to address informal, culturally rich expressions in regional dialects, resulting in limited detection capability and biased moderation that leaves harmful dialectal content undetected in Bangladesh's linguistically diverse digital platforms.

Method: The researchers created BIDWESH by translating and annotating 9,183 instances from the existing BD-SHS corpus into three major regional Bangla dialects (Barishal, Noakhali, and Chittagong), with manual verification and labeling for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group).

Result: The study successfully created a linguistically rich, balanced, and inclusive multi-dialectal hate speech dataset that provides comprehensive coverage of regional Bangla dialects with detailed annotations for hate speech characteristics and target groups.

Conclusion: BIDWESH establishes the foundation for developing dialect-sensitive NLP tools and significantly contributes to more equitable and context-aware content moderation systems for low-resource language settings, particularly addressing the gap in hate speech detection for regional Bangla dialects.

Abstract: Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.

Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Main category: cs.CL

TL;DR: The paper proposes Omni-router Transformer, which uses a shared router across different MoE layers instead of independent routers, achieving better performance in automatic speech recognition with 11.2% and 8.2% word error rate reductions compared to dense and Switch Transformer models respectively.

Details

Motivation: Traditional MoE methods like Switch Transformer route experts independently within each layer, but analysis shows that routers in most layers make expert choices that are not strongly correlated with choices in other layers. This lack of cooperation between experts across layers limits specialization potential.

Method: The authors propose using a shared router across different MoE layers instead of independent routers per layer. This approach, called Omni-router Transformer, is designed to increase cooperation between experts in different layers and encourage greater specialization.

Result: Extensive experiments on large-scale pseudo-labeled datasets and evaluations across 10 diverse out-of-domain ASR benchmarks show that Omni-router Transformer achieves lower training loss and consistently outperforms both dense models (11.2% word error rate reduction) and Switch Transformer models (8.2% word error rate reduction), while providing structured expert usage and improved robustness.

Conclusion: The Omni-router Transformer successfully addresses the limitation of independent expert routing in traditional MoE architectures by implementing shared routing across layers, resulting in improved ASR performance, better expert specialization, and enhanced robustness to diverse data compared to existing approaches.

Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

[16] Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Jared Moore, Ned Cooper, Rasmus Overmark, Beba Cibralic, Nick Haber, Cameron R. Jones

Main category: cs.CL

TL;DR: This paper introduces MindGames, a novel “planning theory of mind” (PToM) task that tests whether agents can infer others’ beliefs and desires to persuade them, finding that humans significantly outperform the o1-preview LLM in social reasoning despite the LLM’s superior planning abilities when preferences are given.

Details

Motivation: While recent evidence suggests LLMs display Theory of Mind abilities, most ToM experiments only test spectatorial prediction/interpretation rather than the dynamic planning and strategic intervention aspects of human ToM. There's a need to evaluate practical use cases of ToM where agents must actively infer and manipulate others' mental states to achieve goals.

Method: The researchers developed MindGames, a “planning theory of mind” task that requires agents to infer an interlocutor’s beliefs and desires in order to persuade them to change their behavior. They compared human performance against o1-preview LLM on this task, along with a baseline condition that requires similar planning complexity but minimal mental state inference (where preferences are already provided).

Result: Humans significantly outperformed o1-preview on the PToM task (11% higher performance, p=0.006). However, o1-preview outperformed humans in the baseline planning condition when mental state inference wasn’t required. The results suggest humans have an implicit causal model of other agents that helps them know to ask about people’s preferences, while LLMs lack this intuitive social reasoning.

Conclusion: There exists a significant gap between human-like social reasoning and current LLM abilities. While LLMs may excel at planning tasks when given explicit information, they struggle with the implicit social understanding required for effective theory of mind in interactive contexts. This highlights limitations in LLMs’ ability to model and strategically influence others’ mental states.

Abstract: Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents’ behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others’ mental states. We present MindGames: a novel `planning theory of mind’ (PToM) task which requires agents to infer an interlocutor’s beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people’s preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone’s preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.

[17] WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability

Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Yao Wan, Kejia Huang, Zhichao Hou, Xuming Hu

Main category: cs.CL

TL;DR: This paper introduces a framework to analyze why Large Language Models output “Unknown” responses, distinguishing between genuine indeterminacy and model incapacity, and tests whether guided stimulation can improve reasoning performance.

Details

Motivation: Current LLM evaluations focus only on whether "Unknown" answers are honest, but fail to distinguish between genuinely indeterminate inputs and solvable problems that models fail to resolve. This creates a "Vague Perception" phenomenon that obscures understanding of LLM reasoning limits.

Method: The authors develop a framework that quantifies the proportion of “Unknown” responses due to model incapacity versus genuine indeterminacy. They test whether guided stimulation can convert “Unknown” responses into either correct (“Known”) or intrinsically indeterminate outcomes, and measure theoretical accuracy baselines for different LLMs on reasoning tasks.

Result: The framework successfully separates different sources of uncertainty in LLM outputs and provides clearer insights into LLM reasoning limits. The method demonstrates the potential for improvement in model reasoning capabilities through guided stimulation techniques.

Conclusion: The work provides a new perspective on LLM reasoning evaluation by addressing the “Vague Perception” phenomenon. It offers a clearer understanding of when LLMs genuinely cannot solve problems versus when they have the capacity but fail to utilize it, contributing to better assessment of true reasoning abilities in large language models.

Abstract: Large Language Models (LLMs) frequently output the label \emph{Unknown}, yet current evaluations focus almost exclusively on whether such answers are \emph{honest} rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon \emph{Vague Perception}. And thus we introduce a framework that quantifies the proportion of \emph{Unknown} responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct (\emph{Known}) or intrinsically indeterminate outcomes. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the true reasoning ability of LLMs and providing a new perspective on solving the \emph{Vague Perception} phenomenon.

[18] Towards Compute-Optimal Many-Shot In-Context Learning

Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

Main category: cs.CL

TL;DR: This paper proposes two efficient demonstration selection strategies for many-shot in-context learning that combine similarity-based selection with cached random demonstrations, achieving better performance than random selection while maintaining computational efficiency and supporting caching.

Details

Motivation: Many-shot in-context learning typically uses random demonstration selection due to high inference costs and caching benefits, but this approach may not be optimal for performance. The authors aim to develop better demonstration selection strategies that maintain computational efficiency while improving performance.

Method: Two strategies are proposed: (1) combining a small number of similarity-based demonstrations with a larger set of cached random demonstrations, and (2) improving the first method by replacing random demonstrations with centroid-based selections using k-means clustering on test sample representations.

Result: Experiments with Gemini Pro and Flash across multiple datasets show that both proposed strategies consistently outperform random selection and match or exceed the best performing selection approaches while reducing inference costs by up to an order of magnitude and supporting caching.

Conclusion: The proposed demonstration selection strategies successfully balance performance and computational efficiency in many-shot ICL, providing practical solutions that improve upon random selection while maintaining the benefits of caching and cost reduction.

Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

[19] Adaptive Graph Pruning for Multi-Agent Communication

Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang

Main category: cs.CL

TL;DR: This paper proposes Adaptive Graph Pruning (AGP), a framework that dynamically optimizes both the number of agents and their communication topology in LLM-based multi-agent systems for better task adaptation, achieving 2.58%-9.84% performance improvement with 90%+ reduction in token consumption.

Details

Motivation: Current LLM-based multi-agent systems use fixed numbers of agents and static communication structures, which limits their ability to adapt to varying task complexities. This rigidity prevents optimal performance across different types of tasks that may require different collaboration patterns.

Method: The paper introduces a two-stage training strategy: (1) independently training soft-pruning networks for different agent quantities to determine optimal communication graphs and positional masks, and (2) jointly optimizing hard-pruning (agent quantity) and soft-pruning (communication topology) within a maximum complete graph to dynamically configure agents and their communication per task.

Result: AGP achieves state-of-the-art results across six benchmarks with 2.58%-9.84% performance improvement, generalizes across multiple LLM architectures, reduces token consumption by 90%+, and achieves high performance with very few training steps (surpassing baselines after about ten steps).

Conclusion: The proposed AGP framework successfully addresses the limitations of fixed multi-agent systems by providing a task-adaptive solution that jointly optimizes agent quantity and communication topology, demonstrating superior performance, efficiency, and generalizability across various reasoning and code generation tasks.

Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58%\sim 9.84%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

[20] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Run Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang, Li Zhao, Shan Sun, Zhengwen Qiu

Main category: cs.CL

TL;DR: This paper introduces FinResearchBench, the first comprehensive evaluation framework for AI research agents in finance, using a novel Agent-as-a-Judge system with logic tree extraction to assess agent performance across 70 financial research questions and 7 task types.

Details

Motivation: The rapid evolution of AI agents in professional research applications, particularly deep research agents for complex long-horizon tasks, lacks systematic and automatic evaluation frameworks. Financial research problems have distinct complexity and subtlety that existing benchmarks don't adequately address, creating a critical gap in evaluating financial research agent capabilities.

Method: The authors propose FinResearchBench, a logic tree based Agent-as-a-Judge evaluation system. The method extracts logic trees from research outcomes and uses them as intermediate information to provide comprehensive assessment. The framework covers 70 typical financial research questions across 7 frequently encountered task types in the financial domain.

Result: The paper presents the first innovative Agent-as-a-Judge system specifically designed for financial research agents, providing comprehensive and automatic assessment capabilities. The benchmark covers 7 key types of financial research tasks with 70 representative questions, enabling systematic evaluation of research agent performance in the financial domain.

Conclusion: FinResearchBench fills a critical gap in evaluating AI research agents for financial applications by providing the first comprehensive, automatic, and domain-specific evaluation framework. The logic tree based approach offers reliable and robust assessment, while the finance-oriented design ensures relevance to real-world financial research scenarios.

Abstract: Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.

[21] Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar

Main category: cs.CL

TL;DR: This paper proposes a method to improve LLM-based tutors by using latent state representation and long-term policy optimization instead of immediate turn-level feedback, achieving better tutoring outcomes with less computational cost.

Details

Motivation: Existing RLHF frameworks for LLMs optimize responses based on immediate turn-level preferences, which is inadequate for multi-turn dialogue settings like online math tutoring where long-term objectives matter more than immediate responses.

Method: The authors represent dialogue history with a lower-dimensional latent state representation of the student and optimize a long-term policy to determine high-level actions based on this latent state, rather than training end-to-end to directly output utterances.

Result: The proposed lightweight model requires less computational resources than prior end-to-end training approaches and demonstrates improved long-term outcomes in LLM-simulated tutoring tasks compared to prompting methods.

Conclusion: The study shows that incorporating latent state representation and long-term policy optimization can enhance LLM-based tutors to better align with the objective of guiding students toward independent problem-solving, while being more computationally efficient.

Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor’s behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor’s next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

[22] iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss

Yujian Sun, Tian Li

Main category: cs.CL

TL;DR: This paper proposes an Effective Unlearning Loss method for making Large Language Models forget sensitive content while preserving their capabilities, achieving 5th place in SemEval 2025 Task 4 competition.

Details

Motivation: As LLMs gain widespread adoption, there's an increasing need to make them forget non-compliant data memorized during pre-training. Machine unlearning is required to efficiently erase sensitive information from LLMs under limited computational resources, which motivated the development of better unlearning techniques.

Method: The authors propose an “Effective Unlearning Loss” - a more controllable forgetting loss function. They explore integrating this loss with various techniques to achieve more efficient and controlled unlearning of sensitive content from LLMs.

Result: The proposed system achieved 5th place on the SemEval 2025 Task 4 competition leaderboard, demonstrating effective performance in both forgetting sensitive content and preserving standard LLM capabilities.

Conclusion: The Effective Unlearning Loss provides a more controllable approach to machine unlearning in LLMs, successfully balancing the dual objectives of erasing sensitive information while maintaining model performance on standard tasks.

Abstract: As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: “Unlearning Sensitive Content from Large Language Models” introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.

[23] Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Le Sun

Main category: cs.CL

TL;DR: This paper introduces AOE (Arranged and Organized Extraction Benchmark), a new bilingual benchmark to evaluate LLMs’ ability to extract and organize information from complex documents into structured tables, revealing that even advanced LLMs struggle with this task.

Details

Motivation: Current LLMs generate chaotic, disorganized, and untraceable paragraph-style answers when extracting information from complex real-world documents, despite expectations that they should effectively handle such tasks. There's a need for systematic evaluation of LLMs' ability to comprehend fragmented documents and reorganize information into structured formats.

Method: The authors created AOE, a bilingual benchmark with varying document lengths that includes 11 carefully crafted tasks across three diverse domains. Unlike conventional text-to-table tasks with fixed schema, AOE requires models to generate context-specific schema tailored to varied input queries for organizing extracted information into tables.

Result: Evaluation of both open-source and closed-source state-of-the-art LLMs showed that even the most advanced models struggled significantly with the AOE benchmark tasks, indicating poor performance in organized information extraction from complex documents.

Conclusion: The study reveals a significant gap in current LLMs’ capabilities for structured information extraction from complex documents. The AOE benchmark provides a systematic way to evaluate and improve LLMs’ ability to organize fragmented information, with the benchmark made publicly available for further research.

Abstract: With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.

[24] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Paul-Andrei Pogăcean, Sanda-Maria Avram

Main category: cs.CL

TL;DR: This paper presents a non-AI mathematical algorithm for language identification using monogram and bigram frequency rankings, achieving over 80% accuracy on short texts and 100% accuracy on longer/older texts, demonstrating that classical frequency-based methods remain competitive with AI models.

Details

Motivation: While AI-powered language models dominate current language identification research, non-AI approaches have been overlooked despite their potential effectiveness. The authors aim to demonstrate that classical mathematical methods can still provide viable alternatives to computationally intensive AI-driven models.

Method: The research implements a mathematical algorithm based on monogram and bigram frequency rankings derived from established linguistic research. The approach analyzes character and character-pair frequencies to determine language patterns without relying on AI or machine learning techniques.

Result: The method achieved over 80% accuracy on texts shorter than 150 characters and reached 100% accuracy for longer texts and older writings. The algorithm was tested on diverse datasets including short stories, fairy tales, and poems spanning different historical periods and genres.

Conclusion: Classical frequency-based approaches for language identification remain effective and scalable alternatives to AI-driven models. The mathematical implementation demonstrates that non-AI methods can achieve high accuracy across various text types and lengths, suggesting their continued relevance in language detection tasks.

Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts and older writings. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

[25] SpeLLM: Character-Level Multi-Head Decoding

Amit Ben-Artzy, Roy Schwartz

Main category: cs.CL

TL;DR: SpeLLM decouples input and output vocabularies in LLMs by using multiple character-level prediction heads instead of a single large output projection layer, achieving competitive performance while reducing runtime by 5.1% and enabling better support for underrepresented languages.

Details

Motivation: Current LLM architectures face a critical bottleneck when scaling vocabulary size: the output projection layer scales linearly with vocabulary size, making substantial vocabulary expansion impractical despite its benefits for reducing input sequence length and attention's quadratic cost.

Method: SpeLLM uses multiple independent linear heads (k heads) that each predict a single character simultaneously, decoupling input and output vocabularies. The authors also develop a self-distillation approach to convert standard LLMs into SpeLLM variants.

Result: Experiments on four pre-trained LLMs showed that SpeLLM variants achieved competitive performance on downstream tasks while reducing runtime by 5.1% on average across models.

Conclusion: SpeLLM provides a potential avenue for reducing LLM computational costs while increasing support for underrepresented languages and domains by enabling more efficient vocabulary scaling through character-level prediction with multiple output heads.

Abstract: Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention’s quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the $k$ linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.

[26] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Main category: cs.CL

TL;DR: This paper presents a method to train Large Language Models (LLMs) for formal software verification using Dafny programming language, reducing the need for human supervision through automatic data curation and reinforcement learning with formal verifier feedback.

Details

Motivation: Existing LLMs trained with RL face unreliable and unscalable verification processes, and providing human-annotated priors for complex programming tasks is prohibitively expensive. The authors aim to leverage formal languages like Dafny to enable automatic and mathematically provable verification of LLM reasoning processes.

Method: The approach uses an automatic and scalable data curation pipeline combined with carefully designed reinforcement learning that integrates feedback from formal language verifiers. They introduce DafnyComp benchmark for compositional formal programs and employ supervised fine-tuning followed by RL with regularization.

Result: Even small models (0.5B parameters) can generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization achieves stronger generalization to out-of-domain tasks and outperforms all strong baselines on the DafnyComp benchmark.

Conclusion: Formal language-based reasoning provides a promising alternative to informal language approaches for LLM training, enabling reliable and scalable verification while reducing dependence on human supervision for complex programming tasks.

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

[27] GG-BBQ: German Gender Bias Benchmark for Question Answering

Shalaka Satheesh, Katrin Klug, Katharina Beckh, Héctor Allende-Cid, Sebastian Houben, Teena Hassan

Main category: cs.CL

TL;DR: This paper creates a German dataset for evaluating gender bias in German Large Language Models by translating and manually correcting an English bias benchmark, finding that all tested models exhibit gender bias both reinforcing and against social stereotypes.

Details

Motivation: Existing fairness evaluation datasets for NLP are primarily in English, but evaluating bias in non-English languages like German requires language-specific considerations due to grammatical gender differences. There's a need for proper German datasets to assess gender bias in German LLMs.

Method: The authors machine translated the gender identity subset of the Bias Benchmark for Question Answering from English to German, then manually reviewed and corrected translation errors with language expert assistance. They created two subsets: Subset-I with gender identity group terms and Subset-II with proper names replacing group terms.

Result: The manual revision of machine translation was found to be crucial for creating reliable bias evaluation datasets. All evaluated German LLMs exhibited gender bias in both directions - reinforcing existing social stereotypes and biasing against them. The study provides accuracy and bias scores for multiple German NLP models.

Conclusion: Manual correction of machine-translated bias evaluation datasets is essential when adapting them to languages with grammatical gender like German. The created German dataset successfully identifies gender bias in German LLMs, demonstrating that bias evaluation tools must be carefully adapted for different languages rather than relying solely on machine translation.

Abstract: Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model’s predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

[28] PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Hui Xiang, Jinqiao Shi, Ting Zhang, Xiaojie Zhao, Yong Liu, Yong Ma

Main category: cs.CL

TL;DR: PromptAL is a hybrid active learning framework that uses sample-aware dynamic soft prompts to leverage unlabeled data for better decision boundary optimization in few-shot scenarios, achieving superior performance across multiple datasets.

Details

Motivation: In few-shot active learning scenarios, the empirical distribution of labeled data often diverges significantly from the target distribution, causing decision boundaries to shift away from optimal positions. Existing methods overlook the role of unlabeled samples in enhancing empirical distribution alignment with target distribution, resulting in suboptimal sample selection.

Method: PromptAL uses a two-step approach: (1) leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model’s predictive distribution and decision boundary, and (2) integrates uncertainty estimation with both global and local diversity based on the adjusted decision boundary to select high-quality samples that better represent the target distribution.

Result: Experimental evaluation on six in-domain and three out-of-domain datasets demonstrates that PromptAL achieves superior performance compared to nine baseline methods, with the codebase being openly accessible.

Conclusion: PromptAL successfully addresses the distribution mismatch problem in few-shot active learning by incorporating unlabeled data through dynamic soft prompts, leading to better decision boundary optimization and more effective sample selection for annotation.

Abstract: Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model’s predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.

Elza Strazda, Gerasimos Spanakis

Main category: cs.CL

TL;DR: This paper introduces a Dutch version of the CrowS-Pairs dataset to measure bias in Dutch language models, finding that various models exhibit substantial bias with English models showing the most bias and Dutch models the least, while persona assignment affects bias levels.

Details

Motivation: Most bias measurement studies in language models have focused only on English, creating a need to evaluate bias in other languages like Dutch. Given the growing popularity of language models, ensuring safe and fair models across different languages is necessary.

Method: The authors created a Dutch version of the US-specific CrowS-Pairs dataset consisting of 1463 sentence pairs covering 9 bias categories (sexual orientation, gender, disability, etc.). They evaluated bias in Dutch models (BERTje, RobBERT, GEITje, Mistral-7B), English models (BERT, RoBERTa), and French models (FlauBERT, CamemBERT) using contrasting sentence pairs about disadvantaged vs advantaged groups.

Result: All tested language models exhibited substantial bias across various categories. English models showed the highest levels of bias, while Dutch models showed the least. The study also found that assigning personas to language models changes their bias levels, indicating variability across languages and contexts.

Conclusion: The findings reveal that bias levels vary significantly across different languages and cultural contexts, with cultural and linguistic factors playing a significant role in shaping model biases. This highlights the importance of language-specific bias evaluation and mitigation strategies.

Abstract: Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.

[30] Towards Enforcing Company Policy Adherence in Agentic Workflows

Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor

Main category: cs.CL

TL;DR: This paper introduces a framework to ensure LLM agents follow business policies through a two-phase approach: offline compilation of policies into verifiable guard code and runtime compliance checking before agent actions.

Details

Motivation: LLM agents show potential for business process automation but struggle to reliably follow complex company policies, creating a need for systematic policy enforcement mechanisms in agentic workflows.

Method: A deterministic, transparent, and modular framework with two phases: (1) offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) runtime integration where guards ensure compliance before each agent action.

Result: Demonstrated encouraging preliminary results in policy enforcement on the challenging τ-bench Airlines domain, showing the framework’s effectiveness in maintaining business policy adherence.

Conclusion: The proposed framework offers a promising solution for enforcing business policy compliance in LLM agents, though key challenges remain for real-world deployment scenarios.

Abstract: Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $\tau$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.

[31] ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

Main category: cs.CL

TL;DR: This paper introduces the ICR Score and ICR Probe method for detecting hallucinations in large language models by analyzing the dynamic evolution of hidden states across layers, achieving superior performance with fewer parameters compared to existing static detection methods.

Details

Motivation: Large language models generate hallucinations that undermine their reliability. Existing hallucination detection methods focus on static and isolated hidden state representations, overlooking their dynamic evolution across layers, which limits their effectiveness in detecting hallucinations.

Method: The authors introduce the ICR Score (Information Contribution to Residual Stream) metric to quantify module contributions to hidden state updates, and develop the ICR Probe detection method that captures the cross-layer evolution of hidden states rather than using static representations.

Result: The ICR Probe achieves superior hallucination detection performance while using significantly fewer parameters than existing methods. The ICR Score is empirically validated as effective and reliable for distinguishing hallucinations from truthful outputs.

Conclusion: The dynamic analysis of hidden state evolution across layers through the ICR Score and ICR Probe provides a more effective approach to hallucination detection in LLMs, offering both improved performance and better interpretability compared to static detection methods.

Abstract: Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states’ update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.

[32] Combining Language and Topic Models for Hierarchical Text Classification

Jaco du Toit, Marcel Dunaiski

Main category: cs.CL

TL;DR: This paper investigates whether combining pre-trained language models (PLMs) with topic models improves hierarchical text classification performance, finding that topic model features generally decrease performance compared to using PLMs alone.

Details

Motivation: Previous work suggested that combining PLMs with topic models is effective for multi-label text classification, as PLMs capture fine-grained contextual information while topic models provide high-level corpus-wide representations. The authors wanted to test whether this combination is actually beneficial for hierarchical text classification tasks.

Method: The authors developed a HTC approach that extracts features from both a PLM and a topic model, passes these features through separate convolutional layers, combines the outputs, and uses label-wise attention mechanisms to obtain label-specific document representations. They conducted comprehensive experiments on three HTC benchmark datasets.

Result: The experiments showed that using features extracted from topic models generally decreased classification performance compared to using only features obtained by the PLM. This contradicts previous assumptions about the benefits of combining these approaches.

Conclusion: The incorporation of features extracted from topic models for text classification tasks should not be assumed to be beneficial. For hierarchical text classification, PLM-only approaches outperform PLM-topic model combinations, challenging previous assumptions in the field.

Abstract: Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.

[33] The Ever-Evolving Science Exam

Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

Main category: cs.CL

TL;DR: The paper introduces EESE (Ever-Evolving Science Exam), a dynamic benchmark with 100K+ science questions to evaluate foundation models’ scientific understanding while addressing data leakage and evaluation inefficiency issues through a periodically updated 500-instance subset.

Details

Motivation: Existing science benchmarks for foundation models face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. As foundation models grow rapidly in capability and deployment, there is an increasing need for reliable assessment of their scientific understanding.

Method: The approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science question-answer pairs across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor; 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations.

Result: Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions, providing a robust evaluation framework for scientific capabilities.

Conclusion: EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions while addressing key limitations of existing benchmarks through its dynamic, periodically updated structure.

Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

[34] Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness

Siqi Liu, Guangrong Dai, Dechao Li

Main category: cs.CL

TL;DR: This study examines how sentence-level Quality Estimation (QE) affects machine translation post-editing efficiency and finds that QE significantly reduces post-editing time for English-Chinese translation, though inaccurate QE can be counterproductive.

Details

Motivation: To investigate the practical usefulness of Quality Estimation in machine translation post-editing workflows, specifically examining its impact on post-editing speed and translator perceptions in English-Chinese translation tasks.

Method: Conducted a preliminary study analyzing sentence-level Quality Estimation in English-Chinese Machine Translation Post-Editing with student translators, examining interaction effects between QE and MT quality levels, as well as between QE and translation expertise levels, supplemented with interview data.

Result: QE significantly reduces post-editing time consistently across medium- and high-quality MT outputs and among translators with varying expertise levels. QE serves multiple functions including identifying problematic segments, validating translator evaluations, and enabling quality double-checking, though inaccurate QE can hinder the post-editing process.

Conclusion: Quality Estimation effectively improves machine translation post-editing efficiency regardless of MT quality or translator expertise level, but accuracy of QE is crucial for optimal performance. The research provides insights for better integration of QE into post-editing workflows to enhance translator productivity.

Abstract: This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators' perceptions. It also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The examined interaction effects were not significant, suggesting that QE consistently improves MTPE efficiency across medium- and high-quality MT outputs and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators’ evaluations of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators’ productivity.

[35] Learning Text Styles: A Study on Transfer, Attribution, and Verification

Zhiqiang Hu

Main category: cs.CL

TL;DR: This thesis develops computational methods for understanding and manipulating text styles through three main areas: text style transfer, authorship attribution, and authorship verification, using large language models with parameter-efficient adaptation techniques.

Details

Motivation: The need to computationally understand and manipulate text styles across three critical tasks: modifying stylistic properties while preserving content, identifying authors through stylistic analysis, and verifying shared authorship between texts.

Method: The approach uses three key techniques: (1) parameter-efficient adaptation of large language models, (2) contrastive disentanglement of stylistic features, and (3) instruction-based fine-tuning for explainable verification to address challenges in text style transfer, authorship attribution, and authorship verification.

Result: The thesis successfully addresses critical challenges in all three interconnected areas of text style analysis, demonstrating effective computational methods for style manipulation and authorship analysis.

Conclusion: The research establishes a comprehensive framework for computational text style understanding through interconnected approaches to style transfer, authorship attribution, and verification, leveraging modern large language model techniques for improved performance and explainability.

Abstract: This thesis advances the computational understanding and manipulation of text styles through three interconnected pillars: (1) Text Style Transfer (TST), which alters stylistic properties (e.g., sentiment, formality) while preserving content; (2)Authorship Attribution (AA), identifying the author of a text via stylistic fingerprints; and (3) Authorship Verification (AV), determining whether two texts share the same authorship. We address critical challenges in these areas by leveraging parameter-efficient adaptation of large language models (LLMs), contrastive disentanglement of stylistic features, and instruction-based fine-tuning for explainable verification.

[36] Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schlüter

Main category: cs.CL

TL;DR: This paper presents five German datasets for evaluating gender bias in large language models (LLMs), revealing unique challenges specific to the German language and demonstrating the need for language-specific bias evaluation frameworks.

Details

Motivation: Existing gender bias evaluation methods for LLMs were primarily developed for English, creating a challenge in transferability when applied to other languages. There was a need for language-specific evaluation frameworks, particularly for German, to properly assess gender bias across multilingual models.

Method: The authors developed five German datasets grounded in established gender bias concepts and made them accessible through multiple methodologies. They evaluated eight multilingual LLM models using these datasets to assess gender bias specifically in the German language context.

Result: The evaluation revealed unique German-specific challenges in gender bias, including ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. The findings demonstrated that gender bias manifests differently across languages, with German presenting distinct patterns compared to English.

Conclusion: The work contributes to understanding gender bias in LLMs across different languages and emphasizes the necessity for developing tailored, language-specific evaluation frameworks rather than directly transferring English-based methods to other languages.

Abstract: In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.

[37] Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

Mohamad Ballout, Serwan Jassim, Elia Bruni

Main category: cs.CL

TL;DR: This paper evaluates state-of-the-art multimodal large language models on intuitive physics tasks and finds that while vision encoders can capture physical plausibility cues, the models fail due to poor vision-language alignment rather than vision component limitations.

Details

Motivation: Current multimodal large language models (MLLMs) struggle with intuitive physics reasoning tasks, but it's unclear whether the limitation stems from the vision component, language component, or their integration. Understanding this is crucial for improving MLLM performance on physical reasoning.

Method: Systematic evaluation of state-of-the-art MLLMs (InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and Gemini 2.0 Flash Thinking) on GRASP and IntPhys 2 datasets, combined with probing analysis of model embeddings at key processing stages to examine how task-relevant information is preserved and utilized.

Result: Even the latest MLLMs struggle to reliably distinguish physically plausible from implausible scenarios. Vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, revealing a critical vision-language misalignment that depends on task difficulty.

Conclusion: The primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Vision-language alignment emerges as a key area for improvement in future MLLM development.

Abstract: This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

[38] Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deußer, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, Rafet Sifa

Main category: cs.CL

TL;DR: This paper evaluates the effectiveness of various Large Language Models (LLMs) in financial document auditing and regulatory compliance, comparing open-source models like Llama-2 with proprietary models like GPT-4 using datasets from PricewaterhouseCoopers Germany.

Details

Motivation: Current AI-driven financial auditing systems can recommend relevant text passages from financial reports but fail to verify whether these excerpts actually comply with specific legal requirements. There's a need to assess how well different LLMs can handle regulatory compliance verification in financial auditing.

Method: The researchers conducted a comparative analysis of publicly available LLMs across different model configurations, focusing on open-source models (Llama-2) versus proprietary models (GPT series). They used two custom datasets provided by PricewaterhouseCoopers Germany to evaluate performance in regulatory compliance detection.

Result: The open-source Llama-2 70 billion parameter model excelled at detecting non-compliance cases (true negatives), outperforming all proprietary counterparts in this specific task. However, proprietary models like GPT-4 showed superior overall performance across a broader range of scenarios, especially in non-English language contexts.

Conclusion: While open-source models like Llama-2 can outperform proprietary models in specific compliance detection tasks, proprietary models like GPT-4 remain superior for general-purpose financial auditing applications, particularly in multilingual environments. The choice between models depends on the specific requirements and language contexts of the auditing task.

Abstract: The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI’s GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

[39] P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Dongjun Jang, Youngchae Ahn, Hyopil Shin

Main category: cs.CL

TL;DR: This study introduces a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompting method to enhance phonological reasoning in large language models, achieving up to 52% improvement on tasks like rhyme generation and syllable counting across 12 LLMs.

Details

Motivation: Text-based large language models lack explicit phonological knowledge, yet phonological reasoning is crucial for language understanding. The study aims to explore and enhance LLMs' latent phonological abilities through better prompting strategies.

Method: The researchers developed a Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompting technique based on educational theories like scaffolding and discovery learning. They evaluated this method alongside few-shot learning across 12 LLMs using the PhonologyBench benchmark for tasks including rhyme word generation, grapheme-to-phoneme conversion, and syllable counting.

Result: The P-CoT method consistently enhanced performance across all tested LLMs, achieving up to 52% improvement over baseline methods. The approach even surpassed human performance on certain phonological tasks, while few-shot learning showed inconsistent gains.

Conclusion: The study demonstrates that structured pedagogical prompting can effectively activate latent phonological abilities in LLMs. The P-CoT method provides a promising approach for enhancing phonological reasoning in text-based models, with potential for optimization and application to other linguistic domains.

Abstract: This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.

[40] Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Lin, Yingya Zhang, Shiwei Zhang, Difan Zou

Main category: cs.CL

TL;DR: This paper identifies and addresses self-contradiction in multimodal large language models (MLLMs) where models generate images that they themselves judge as misaligned with input prompts, proposing methods to leverage this contradiction for self-improvement.

Details

Motivation: MLLMs exhibit self-contradiction where their generation capabilities produce outputs that their own understanding capabilities deem misaligned with prompts. This asymmetry between generation and understanding suggests untapped potential for using stronger understanding to improve weaker generation capabilities.

Method: The authors define a “Nonunified score” to quantify self-contradiction, apply standard post-training methods (SFT, DPO) with internal supervision from the model’s understanding branch, and propose a curriculum-based strategy that gradually introduces harder samples as the model improves.

Result: Post-training with internal supervision successfully improves both generation quality and model unification. The study discovers a co-improvement effect where fine-tuning only the generation branch also enhances understanding capabilities. However, poor supervision can lead to co-degradation, and intrinsic metrics cannot distinguish between improvement and degradation.

Conclusion: Self-contradiction in MLLMs primarily stems from weak generation rather than poor understanding. This capability gap can be leveraged for self-improvement through internal supervision, leading to better unified multimodal models. The curriculum-based approach effectively enhances both generation and understanding while highlighting the critical importance of supervision quality.

Abstract: Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model’s own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

[41] PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie

Main category: cs.CL

TL;DR: PICACO is a novel method for In-Context Alignment that addresses value tensions in LLMs by optimizing meta-instructions to navigate multiple conflicting human values simultaneously, achieving better balance across up to 8 distinct values without requiring fine-tuning.

Details

Motivation: Current In-Context Alignment (ICA) methods face the "Instruction Bottleneck" challenge where LLMs struggle to reconcile multiple intended values within a single prompt due to value tensions - human values are pluralistic and often impose conflicting demands (e.g., stimulation vs. tradition), leading to incomplete or biased alignment.

Method: PICACO optimizes a meta-instruction that navigates multiple values by maximizing the total correlation between specified values and LLM responses. This approach theoretically reinforces value correlation while reducing distractive noise, resulting in effective value instructions that better elicit LLMs’ understanding of multiple values without requiring fine-tuning.

Result: Extensive experiments on five value sets demonstrate that PICACO works effectively with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves better balance across up to 8 distinct values compared to existing methods.

Conclusion: PICACO successfully addresses the value tension problem in In-Context Alignment by providing a pluralistic approach that can handle multiple conflicting human values simultaneously, offering a practical solution for aligning LLMs with diverse human preferences without costly post-training.

Abstract: In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs’ comprehension of input prompts remains agnostic, limiting ICA’s ability to address value tensions–human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs’ understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

[42] Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM

Lars Hillebrand, David Biesner, Christian Bauckhage, Rafet Sifa

Main category: cs.CL

TL;DR: The paper presents a row-stochastic variation of the DEDICOM matrix factorization algorithm applied to pointwise mutual information matrices for simultaneous topic modeling and word embedding learning in text corpora.

Details

Motivation: To develop a uniquely interpretable matrix factorization method that can simultaneously identify latent topic clusters within vocabulary and learn interpretable word embeddings from text corpora, leveraging the interpretability advantages of DEDICOM for both symmetric and asymmetric square matrices.

Method: A new row-stochastic variation of the DEDICOM algorithm applied to pointwise mutual information matrices of text corpora, with an efficient training method for the constrained DEDICOM algorithm.

Result: The method successfully identifies latent topic clusters within vocabulary while simultaneously learning interpretable word embeddings, with qualitative evaluation demonstrating performance in both topic modeling and word embedding tasks.

Conclusion: The row-stochastic DEDICOM variation provides an effective approach for interpretable matrix factorization that can simultaneously perform topic modeling and word embedding learning, offering a unified framework for understanding text corpora structure.

Abstract: The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

[43] Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

Lars Hillebrand, Armin Berger, Daniel Uedelhoven, David Berghaus, Ulrich Warning, Tim Dilmaghani, Bernd Kliem, Thomas Schmid, Rüdiger Loitz, Rafet Sifa

Main category: cs.CL

TL;DR: A novel RAG system using LLMs, hybrid search, and relevance boosting significantly improves Risk and Quality query processing in regulated industries, outperforming traditional RAG approaches on 124 real-world queries.

Details

Motivation: Traditional R&Q assurance in regulated industries relies on specialized experts for policy interpretation, creating operational bottlenecks and limiting scalability when handling numerous daily queries requiring accurate regulatory compliance.

Method: Development of a Retrieval Augmented Generation (RAG) system that combines Large Language Models (LLMs) with hybrid search and relevance boosting techniques to automate and enhance R&Q query processing.

Result: The actively deployed system shows substantial improvements over traditional RAG approaches when evaluated on 124 expert-annotated real-world queries, with extensive hyperparameter analysis providing configuration insights.

Conclusion: The novel RAG system successfully addresses scalability issues in regulated industry R&Q assurance by automating policy interpretation while maintaining accuracy, offering a practical solution for improving operational efficiency in compliance-heavy environments.

Abstract: Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.

[44] RAVine: Reality-Aligned Evaluation for Agentic Search

Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao

Main category: cs.CL

TL;DR: RAVine is a new evaluation framework for agentic search systems that addresses limitations in existing benchmarks by using realistic user queries, improving ground truth extraction, and evaluating the iterative search process rather than just final answers.

Details

Motivation: Existing evaluation frameworks for agentic search systems have three major limitations: (1) complex queries that don't reflect realistic user scenarios, (2) noisy ground truth extraction that distorts fine-grained assessments, and (3) focus only on final answer quality while ignoring the iterative search process that is central to agentic search.

Method: RAVine introduces three key innovations: (1) targets multi-point queries and long-form answers that better align with user intents, (2) implements an attributable ground truth construction strategy to enhance fine-grained evaluation accuracy, and (3) examines model interactions with search tools throughout the iterative process while accounting for efficiency factors.

Result: The authors benchmarked a series of models using RAVine and derived several insights about agentic search systems, though specific quantitative results are not detailed in the abstract. The framework and datasets have been made publicly available.

Conclusion: RAVine provides a more reality-aligned evaluation framework that better captures the true capabilities of agentic search systems by evaluating both the quality of outputs and the efficiency of the iterative search process, potentially advancing the development of more effective agentic search systems.

Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

[45] Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals

Jingni Wu, Amir Zeldes

Main category: cs.CL

TL;DR: This paper investigates how polysemous discourse markers (DMs) interact with non-DM signals in English discourse, finding that polysemous DMs co-occur with more diverse but not necessarily more numerous non-DM signals, with genre significantly affecting these patterns.

Details

Motivation: Discourse markers are crucial for coherence but often ambiguous and can be replaced by or co-occur with non-DMs. The interaction mechanism between DMs and non-DM signals remains unclear but is essential for disambiguation, creating a need to understand how DM polysemy relates to co-occurrence patterns with other signals.

Method: The authors use the eRST framework to propose a graded definition of DM polysemy and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals, while also investigating the influence of genre on these patterns.

Result: Polysemous DMs co-occur with more diverse non-DM signals, but the total number of co-occurring signals does not necessarily increase. Genre plays a significant role in shaping DM-signal interactions.

Conclusion: The study reveals that DM polysemy correlates with diversity rather than quantity of co-occurring non-DM signals, and that genre is a crucial factor in determining how discourse markers interact with other coherence signals in discourse.

Abstract: Discourse markers (DMs) like ‘but’ or ’then’ are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs (‘in the morning’ can mean the same as ’then’), and both can be ambiguous (‘since’ can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions.

[46] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O’Brien, James Glass

Main category: cs.CL

TL;DR: The paper introduces TIM (Thread Inference Model) and TIMRUN runtime that enable LLMs to perform long-horizon reasoning beyond context limits by using reasoning trees and selective memory management, achieving virtually unlimited working memory for complex problem solving.

Details

Motivation: Large language models face context limits that bottleneck reasoning accuracy and efficiency, along with constraints from positional embeddings and GPU memory limitations that prevent effective long-horizon structured reasoning and multi-hop tool usage.

Method: The approach models natural language as reasoning trees (with length and depth) rather than linear sequences, consisting of tasks with thoughts, recursive subtasks, and conclusions. A working memory system retains only key-value states of most relevant context tokens using rule-based subtask-pruning, enabling reuse of positional embeddings and GPU memory throughout reasoning.

Result: The system maintains high inference throughput even when manipulating up to 90% of the KV cache in GPU memory. It demonstrates accurate reasoning on mathematical tasks and successfully handles information retrieval challenges requiring long-horizon reasoning and multi-hop tool use.

Conclusion: TIM and TIMRUN successfully overcome traditional LLM limitations by enabling virtually unlimited working memory and multi-hop tool calls within single model inference, providing an effective solution for complex reasoning tasks that exceed standard context windows.

Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

[47] Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo

Main category: cs.CL

TL;DR: The paper proposes Test-Time-Matching (TTM), a training-free framework that enables LLMs to perform high-fidelity role-playing by automatically decomposing characters into personality, memory, and linguistic style components through a three-stage generation pipeline.

Details

Motivation: Current role-playing language agents face limitations: prompt-based approaches lack deep immersion in specific roles (especially well-known figures), while fine-tuning approaches require extensive data collection and computational resources, limiting their broader applicability.

Method: Test-Time-Matching (TTM) framework uses LLM agents to automatically decouple character features into three components: personality, memory, and linguistic style. It employs a structured three-stage generation pipeline with test-time scaling and context engineering for controlled role-playing without requiring training.

Result: The framework achieves high-fidelity role-playing performance and enables seamless combinations across diverse linguistic styles and variations in personality and memory. Human assessment evaluations demonstrate outstanding performance in generating expressive and stylistically consistent character dialogues.

Conclusion: TTM successfully addresses the limitations of existing role-playing approaches by providing a training-free solution that achieves superior performance in character role-playing through automatic feature decomposition and structured generation, as validated by human assessments.

Abstract: The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character’s features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.

[48] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wang Wei, Peng Zhang

Main category: cs.CL

TL;DR: Researchers developed Agentar-Fin-R1, a series of financial large language models (8B and 32B parameters) based on Qwen3, designed to enhance reasoning capabilities, reliability, and domain specialization for financial applications through systematic optimization and trustworthiness frameworks.

Details

Motivation: Existing large language models often fall short in financial scenarios that demand robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs in the financial domain.

Method: The approach integrates a high-quality financial task taxonomy with a multi-layered trustworthiness assurance framework including trustworthy knowledge engineering, multi-agent data synthesis, data validation governance, label-guided automated difficulty-aware optimization, and two-stage learning processes.

Result: Agentar-Fin-R1 achieved state-of-the-art performance on financial benchmarks (FinEva, FinEval, FinanceIQ) and general reasoning datasets (MATH-500, GPQA), with exceptional performance on the newly proposed Finova evaluation benchmark for agent-level financial reasoning and compliance verification.

Conclusion: Agentar-Fin-R1 demonstrates effectiveness as a trustworthy solution for high-stakes financial applications, achieving both superior financial task performance and exceptional general reasoning capabilities through systematic optimization and comprehensive trustworthiness frameworks.

Abstract: Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.

[49] LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Shu-Kai Hsieh

Main category: cs.CL

TL;DR: LingBench++ is a new benchmark for evaluating large language models on complex linguistic tasks from the International Linguistics Olympiad, featuring structured reasoning traces and multi-agent architecture that outperforms traditional approaches.

Details

Motivation: Existing benchmarks for LLMs focus only on final answer accuracy without providing structured reasoning evaluation. There's a need for linguistically-informed evaluation that covers low-resource and cross-cultural languages with interpretable reasoning processes.

Method: The authors develop LingBench++ benchmark with structured reasoning traces, stepwise evaluation protocols, and typological metadata across 90+ languages. They propose a multi-agent architecture that integrates grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing.

Result: Models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability through systematic comparisons of baseline and proposed agentic models.

Conclusion: LingBench++ provides a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in large language models, moving beyond simple accuracy metrics to more sophisticated evaluation frameworks.

Abstract: We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

[50] MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Run-Ze Fan, Zengzhi Wang, Pengfei Liu

Main category: cs.CL

TL;DR: This paper introduces TextbookReasoning and MegaScience, large-scale open-source datasets for scientific reasoning, along with comprehensive evaluation systems and fine-tuned models that significantly outperform existing approaches in scientific domains.

Details

Motivation: The open-source AI community has focused primarily on mathematics and coding while neglecting scientific reasoning, mainly due to the lack of open, large-scale, high-quality, verifiable scientific reasoning datasets needed for developing AI scientists and supporting human researchers in natural science discovery.

Method: The authors created TextbookReasoning dataset from 12k university-level textbooks with 650k reasoning questions across 7 scientific disciplines, developed MegaScience through systematic ablation studies to identify optimal data selection methodologies, built a comprehensive evaluation system across 15 benchmarks, and fine-tuned Llama3.1, Qwen2.5, and Qwen3 series models on the datasets.

Result: The datasets achieved superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Models trained on MegaScience significantly outperformed corresponding official instruct models, with greater effectiveness observed for larger and stronger models, indicating scaling benefits for scientific tuning.

Conclusion: The research successfully addresses the gap in open-source scientific reasoning datasets by providing high-quality resources that demonstrate clear performance improvements. The scaling benefits observed suggest that larger models benefit more from scientific tuning, and the released resources will advance scientific reasoning research in the community.

Abstract: Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

[51] Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing

Ben Hutchinson

Main category: cs.CL

TL;DR: This position paper argues that using religious texts in NLP raises ethical concerns beyond model bias, including data provenance, cultural contexts, and proselytism, calling for greater consideration of researcher positionality and marginalized community perspectives.

Details

Motivation: Religious texts encode culturally important values that machine learning models can reproduce, and their translations are often repurposed in NLP research from their original proselytizing intentions, raising ethical concerns that extend beyond typical model bias considerations.

Method: This is a position paper that presents arguments and considerations rather than empirical methods. The authors analyze the ethical implications of using religious texts in NLP through the lens of data provenance, cultural contexts, and community impact.

Result: The paper identifies key ethical considerations including: the reproduction of cultural values in ML models, the repurposing of religious translations from their original proselytizing goals, and the need to consider impacts on marginalized linguistic and religious communities.

Conclusion: NLP researchers should give greater consideration to researcher positionality and the perspectives of marginalized linguistic and religious communities when using religious texts, as the ethical implications extend far beyond traditional model bias concerns to include data provenance and cultural context issues.

Abstract: This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP’s use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities.

[52] Erasing Conceptual Knowledge from Language Models

Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

Main category: cs.CL

TL;DR: This paper introduces Erasure of Language Memory (ELM), a method for selectively removing specific concepts from language models while preserving their general capabilities, using the model’s own classification abilities to identify and reduce generation of unwanted content.

Details

Motivation: The need for concept-level unlearning in language models to remove unwanted knowledge (like biosecurity or cybersecurity information) while maintaining the model's overall performance and capabilities on other tasks.

Method: ELM uses the language model itself as a classifier to identify concept-specific content, then applies targeted low-rank updates to reduce generation probabilities for undesired concepts by matching distributions defined by the model’s introspective classification capabilities.

Result: ELM-modified models achieve near-random performance on assessments of erased concepts (biosecurity, cybersecurity, literary domains) while maintaining generation coherence, preserving benchmark performance on unrelated tasks, and showing strong robustness against adversarial attacks.

Conclusion: ELM provides an effective principled approach for concept-level unlearning that successfully removes targeted knowledge while preserving model utility and demonstrating robustness, offering a viable solution for selective knowledge removal in language models.

Abstract: In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model’s own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model’s ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model’s broader capabilities. We demonstrate ELM’s efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

[53] Data Processing for the OpenGPT-X Model Family

Nicolo’ Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny Jörg Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian Küch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim Köhler, Johannes Leveling

Main category: cs.CL

TL;DR: This paper describes the data preparation pipeline for OpenGPT-X, a project developing open multilingual large language models for European languages, detailing processing methods for curated and web data with extensive filtering and deduplication.

Details

Motivation: To create open and high-performance multilingual large language models covering all major European languages for real-world applications within the European Union, requiring a comprehensive data preparation pipeline that ensures quality and compliance with European data regulations.

Method: The paper presents a dual-pipeline approach: one for curated data with minimal filtering and another for web data requiring extensive filtering and deduplication. The methodology includes data selection, requirement definition, specialized algorithmic solutions for each data type, and comprehensive processing steps leading to final filtered datasets.

Result: The project successfully developed distinct processing pipelines for curated and web data, created specialized algorithmic solutions for multilingual data preparation, and provided in-depth dataset analysis that increases transparency and aligns with European data regulations.

Conclusion: The paper concludes by sharing key insights and challenges encountered during large-scale multilingual data preparation, offering valuable recommendations for future endeavors in developing multilingual LLMs, particularly emphasizing the importance of specialized pipelines for different data types and regulatory compliance.

Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

[54] Atomic Calibration of LLMs in Long-Form Generations

Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier

Main category: cs.CL

TL;DR: This paper introduces atomic calibration, a fine-grained approach to evaluate confidence calibration in large language models by breaking down long-form responses into atomic claims, addressing the limitations of existing macro-level calibration methods that only provide single confidence scores for entire responses.

Details

Motivation: Large language models suffer from hallucinations in real-world applications, and existing confidence calibration research focuses only on short-form tasks with single response-level confidence scores (macro calibration), which is insufficient for long-form generations that contain complex statements with mixed accurate and inaccurate information.

Method: The authors propose atomic calibration that evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. They classify confidence elicitation methods into discriminative and generative types and demonstrate that combining these methods can enhance calibration performance.

Result: Extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation tasks and can also improve macro calibration results. The approach reveals insightful patterns in LLM confidence throughout the generation process.

Conclusion: Atomic calibration provides a more effective approach for confidence estimation in long-form text generation by operating at the granular level of atomic claims, offering better calibration performance than traditional macro-level methods and providing insights into LLM confidence patterns during generation.

Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs’ trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

[55] Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

Calvin Yixiang Cheng, Scott A Hale

Main category: cs.CL

TL;DR: This study evaluates different computational approaches for measuring moral foundations in non-English texts, finding that large language models (LLMs) and multilingual models outperform machine translation and local lexicon methods, though human validation remains crucial for capturing cultural nuances.

Details

Motivation: Most moral foundation theory resources are developed for English, limiting cross-linguistic applications. There is a need to evaluate computational approaches for measuring moral foundations in non-English corpora to enable broader cross-cultural moral analysis.

Method: The researchers compared four computational approaches using Chinese as a case study: (1) applying English resources to machine translated text, (2) using local language lexicons, (3) employing multilingual language models, and (4) utilizing large language models (LLMs). They evaluated the effectiveness of each approach for measuring moral foundations in non-English texts.

Result: Machine translation and local lexicon approaches proved insufficient for complex moral assessments, often losing cultural information. Multilingual models and LLMs showed reliable cross-language performance with transfer learning capabilities. LLMs demonstrated superior data efficiency compared to other methods.

Conclusion: LLMs show significant potential for cross-language moral foundation measurements and other complex multilingual deductive coding tasks. However, human-in-the-loop validation remains essential as even advanced models may miss cultural nuances in cross-language moral assessments.

Abstract: This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.

[56] Universal Model Routing for Efficient LLM Inference

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

Main category: cs.CL

TL;DR: UniRoute is a dynamic routing approach for LLMs that can route prompts to previously unseen models at test time by representing each LLM as a feature vector based on predictions on representative prompts, achieving effective routing among 30+ unseen LLMs.

Details

Motivation: Existing model routing techniques only work with fixed pools of LLMs known during training, but in practice new LLMs become available at test time. There's a need for dynamic routing that can efficiently route prompts to previously unobserved LLMs to reduce inference costs.

Method: UniRoute represents each LLM as a feature vector derived from predictions on a set of representative prompts. Two instantiations are proposed: cluster-based routing and learned cluster map routing. The approach is theoretically grounded as estimates of an optimal routing rule with quantified excess risk bounds.

Result: Experiments on public benchmarks demonstrate UniRoute’s effectiveness in routing among more than 30 previously unseen LLMs, showing the method can successfully handle dynamic routing scenarios with new models.

Conclusion: UniRoute successfully addresses the dynamic routing problem by enabling effective routing to previously unobserved LLMs through feature-based representation, providing both theoretical guarantees and practical effectiveness across diverse benchmarks.

Abstract: Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

[57] Reasoning Does Not Necessarily Improve Role-Playing Ability

Xiachong Feng, Longxu Dou, Lingpeng Kong

Main category: cs.CL

TL;DR: This study investigates whether reasoning techniques can enhance LLM role-playing capabilities through comprehensive experiments with 6 benchmarks, 24 LLMs, and 3 strategies, finding that reasoning methods may actually hurt role-playing performance and proposing future research directions.

Details

Motivation: The rapid expansion of role-playing LLMs in academic and commercial domains creates increasing demand for high-precision role-playing models, while concurrent advances in reasoning techniques push LLM performance boundaries, raising the question of whether reasoning can enhance role-playing capabilities.

Method: Comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies: direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs to compare effectiveness.

Result: Key findings include: CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English performance.

Conclusion: Based on experimental results, the authors propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance adaptability, consistency, and effectiveness for both research and real-world applications.

Abstract: The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: “Can reasoning techniques enhance the role-playing capabilities of LLMs?” To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

[58] MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

Main category: cs.CL

TL;DR: This paper proposes Mixing Preference Optimization (MPO), a post-processing framework that combines existing single-objective policies to handle diverse human preferences in RLHF, avoiding costly multi-objective training while achieving balanced performance across different preference dimensions.

Details

Motivation: Traditional RLHF relies on a single reward model that overlooks the diversity of human preferences. While multi-dimensional feedback approaches exist, they are costly and unstable due to the competing and heterogeneous nature of human preferences. There's a need for a more efficient alternative to multi-objective RLHF and MaxMin-RLHF.

Method: MPO is a post-processing framework that log-linearly combines existing single-objective policies into a unified policy. The weight of each policy is computed using batch stochastic mirror descent. This approach avoids the need for alignment training from scratch by leveraging already-trained policies.

Result: Empirical results show that MPO achieves balanced performance across diverse preferences while outperforming or matching existing models. The approach significantly reduces computational costs compared to traditional multi-objective RLHF methods.

Conclusion: MPO provides an effective and computationally efficient alternative to multi-objective RLHF by post-processing existing policies rather than training from scratch. It successfully balances diverse human preferences while reducing the costs and instability associated with multi-objective reinforcement learning approaches.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

[59] LLMs syntactically adapt their language use to their conversational partner

Florian Kandra, Vera Demberg, Alexander Koller

Main category: cs.CL

TL;DR: This paper investigates whether large language models (LLMs) exhibit conversational adaptation behavior similar to humans by analyzing conversations between LLM agents and finding evidence of syntactic alignment over time.

Details

Motivation: Human speakers naturally align their language use with each other during conversations, but it's unclear whether large language models exhibit this same conversational adaptation behavior that is fundamental to human communication.

Method: The researchers constructed a corpus of conversations between LLM agents and empirically analyzed the syntactic choices made by the models throughout their interactions to measure language alignment patterns.

Result: The study found that two LLM agents increasingly made more similar syntactic choices as their conversations progressed, demonstrating measurable conversational adaptation behavior.

Conclusion: Modern LLMs do exhibit conversational adaptation and can adapt their language use to their conversational partners, at least in a basic form, similar to the alignment behavior observed in human communication.

Abstract: It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

[60] SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior

Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani

Main category: cs.CL

TL;DR: This paper creates a benchmark using 824 science fiction stories to test AI alignment with human values, finding that LLMs with constitutional rules achieve 95.8% alignment compared to only 21.2% for typical sci-fi decisions, and releases the SciFi-Benchmark dataset with over 9,000 questions for robot ethics research.

Details

Motivation: The rapid progress in AI and robotics raises concerns about whether AI-controlled robots will be aligned with human values. The authors aim to create a scalable method to probe this alignment question by leveraging the rich ethical scenarios found in science fiction literature.

Method: The researchers generated a benchmark from 824 major science fiction works, extracting key moments where AI agents made critical decisions. They used state-of-the-art LLMs to generate questions about similar situations, possible decisions, and alternatives. They measured alignment using human-voted answers and created constitutions that can be automatically improved through an amendment process.

Result: Modern LLMs with constitutions achieved 95.8% alignment with human values, significantly higher than typical sci-fi decisions (21.2%). Constitutions improved base model alignment from 79.4% to 95.8% and showed resilience to adversarial prompts (improving from 23.3% to 92.3%). The constitutions also performed well on the real-world ASIMOV Benchmark.

Conclusion: Sci-fi-inspired constitutions are highly effective for AI alignment and applicable to real-world situations. The study demonstrates that constitutional AI approaches can substantially improve value alignment and provides a valuable dataset (SciFi-Benchmark) with 9,056 questions and 53,384 answers for advancing robot ethics and safety research.

Abstract: Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM’s recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.

[61] Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Mihai Nadas, Laura Diosan, Andreea Tomescu

Main category: cs.CL

TL;DR: This survey reviews how large language models (LLMs) are transforming synthetic training data generation for both text and code domains, examining techniques, benefits, challenges, and future research directions in using artificial data to augment or replace real-world datasets.

Details

Motivation: The motivation stems from the need to address scenarios where labeled data is scarce, expensive, or sensitive. LLMs offer the potential to generate artificial but task-relevant examples that can significantly augment or substitute for real-world datasets, particularly beneficial for low-resource tasks and code-centric applications.

Method: The paper surveys key techniques including prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. These methods are applied to enrich tasks like classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair through automated verification of functional correctness.

Result: The survey identifies significant benefits including cost-effectiveness, broad coverage, and controllable diversity in synthetic data generation. However, it also reveals challenges such as factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Mitigation strategies are proposed ranging from filtering and weighting synthetic outputs to reinforcement learning with execution feedback.

Conclusion: The paper concludes that LLM-generated synthetic data is increasingly important for accelerating AI development, while emphasizing the need for ethical and quality safeguards. Open research directions include automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks to ensure responsible development of synthetic data generation techniques.

Abstract: This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g., classification, question answering) and facilitate code-centric applications (e.g., instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits - cost-effectiveness, broad coverage, and controllable diversity - we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.

[62] Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation

DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng

Main category: cs.CL

TL;DR: The paper proposes Typed-RAG, a framework that improves non-factoid question answering by first classifying questions into predefined types, then decomposing them into focused sub-queries to enhance retrieval and answer quality, demonstrating superior performance over existing methods.

Details

Motivation: Non-factoid question answering (NFQA) is challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning, which reveals limitations of conventional retrieval-augmented generation (RAG) approaches.

Method: Typed-RAG framework that: 1) classifies non-factoid questions into predefined types (e.g., Debate, Experience, Comparison), 2) decomposes questions into focused sub-queries targeting single aspects, and 3) combines sub-query results to produce comprehensive responses. Also constructs Wiki-NFQA benchmark dataset.

Result: Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, demonstrating improved retrieval quality and answer generation for NFQA across various question types in the Wiki-NFQA benchmark.

Conclusion: Type-aware decomposition effectively improves both retrieval relevance and answer quality in non-factoid question answering, validating that structured question classification and decomposition can overcome limitations of conventional RAG approaches for complex, multi-aspect questions.

Abstract: Addressing non-factoid question answering (NFQA) remains challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning. These characteristics often reveal the limitations of conventional retrieval-augmented generation (RAG) approaches. To overcome these challenges, we propose Typed-RAG, a framework for type-aware decomposition of non-factoid questions (NFQs) within the RAG paradigm. Specifically, Typed-RAG first classifies an NFQ into a predefined type (e.g., Debate, Experience, Comparison). It then decomposes the question into focused sub-queries, each focusing on a single aspect. This decomposition enhances both retrieval relevance and answer quality. By combining the results of these sub-queries, Typed-RAG produces more informative and contextually aligned responses. Additionally, we construct Wiki-NFQA, a benchmark dataset for NFQA covering a wide range of NFQ types. Experiments show that Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, validating the effectiveness of type-aware decomposition for improving both retrieval quality and answer generation in NFQA. Our code and dataset are available on https://github.com/TeamNLP/Typed-RAG.

[63] Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Zena Al-Khalili, Nick Howell, Dietrich Klakow

Main category: cs.CL

TL;DR: This paper conducts an in-depth analysis of code-assisted LLMs’ reasoning processes in mathematical tasks, revealing that while closed-source models use sound mathematical reasoning, open-source models often rely on unsound approaches like memorization and exhaustive search, with all models struggling on complex problems despite appearing accurate.

Details

Motivation: Current evaluation of code-assisted LLMs for mathematical reasoning focuses only on execution correctness, lacking rigorous evaluation of the underlying reasoning processes in generated programs. This creates a gap in understanding whether LLMs actually solve math problems through sound logical reasoning or just produce correct outputs through other means.

Method: The researchers evaluated five LLMs on several math datasets using both manual and automatic assessment methods. They developed a taxonomy to classify generated programs based on their logical soundness and analyzed the reasoning processes implemented by different types of models (closed-source vs open-source) across varying problem difficulties.

Result: Closed-source LLMs demonstrate mathematically grounded reasoning in their programs, while open-source models frequently use unsound reasoning approaches including memorized information and exhaustive searches. All models show decreased sound reasoning capabilities as problem difficulty increases, indicating fundamental limitations in complex mathematical reasoning despite maintaining execution accuracy.

Conclusion: The study emphasizes the necessity for more comprehensive evaluation methods for code-assisted LLMs that go beyond simple execution accuracy metrics. Current accuracy-based evaluations fail to capture the quality of reasoning processes, and more holistic assessments are needed to truly understand LLMs’ capabilities and limitations in mathematical domains.

Abstract: Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs’ limits in the math domain.

[64] A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Mingda Zhang, Jianglong Qin

Main category: cs.CL

TL;DR: This paper presents a lightweight medical large language model that uses knowledge transfer, model compression, and computational optimization to make advanced AI models practical for resource-constrained medical environments, achieving 92.1% USMLE accuracy while reducing memory usage by 64.7% and inference latency by 12.4%.

Details

Motivation: Foundation models like DeepSeek-R1 and ChatGPT face critical deployment challenges in medical settings due to high computational requirements and professional knowledge barriers, limiting their accessibility in resource-constrained healthcare environments.

Method: The approach uses three-dimensional optimization: (1) knowledge transfer pipeline from DeepSeek-R1-Distill-70B to 7B using Low-Rank Adaptation (LoRA), (2) model compression through 4-bit quantization and mixed-precision strategies, and (3) computational enhancement with Flash Attention acceleration, continuous batching, and specialized medical prompt templates.

Result: The lightweight model achieves 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models, demonstrating effective preservation of medical reasoning capabilities despite significant compression.

Conclusion: This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, successfully balancing model performance with computational efficiency to enable broader accessibility of AI-assisted healthcare.

Abstract: Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.

[65] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao

Main category: cs.CL

TL;DR: This paper presents the first integration of generative large language models into sign language recognition, proposing GSP-MC method that uses retrieval-augmented generation and multi-step prompting to create precise sign descriptions, achieving state-of-the-art results on Chinese SLR500 (97.1%) and Turkish AUTSL (97.07%) datasets.

Details

Motivation: Sign language recognition faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. Traditional methods struggle with the multimodal nature of sign language that combines complex hand movements, facial expressions, and body postures.

Method: The paper proposes Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that: 1) leverages retrieval-augmented generation (RAG) with domain-specific LLMs, 2) incorporates multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions, 3) employs a dual-encoder architecture for bidirectional alignment of hierarchical skeleton features with multiple text descriptions at global, synonym, and part levels, and 4) combines global and part-level losses using KL divergence optimization.

Result: The method achieves state-of-the-art performance on two datasets: 97.1% accuracy on Chinese SLR500 dataset and 97.07% accuracy on Turkish AUTSL dataset. The cross-lingual effectiveness demonstrates the method’s robustness across different sign languages.

Conclusion: This work successfully demonstrates the first integration of generative LLMs into sign language recognition, showing that the proposed GSP-MC method can effectively address annotation challenges and achieve superior performance. The cross-lingual effectiveness highlights the potential for developing inclusive communication technologies that can work across different sign languages.

Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method’s cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

[66] HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Tajuddeen Gwadabe, Kenneth Church, Vukosi Marivate

Main category: cs.CL

TL;DR: This paper provides a comprehensive overview of Hausa NLP research, introduces HausaNLP catalog for resources aggregation, and proposes strategic directions to advance Hausa language processing despite being a low-resource language with 200+ million speakers.

Details

Motivation: Despite having over 200 million speakers worldwide, Hausa remains understudied as a low-resource language in NLP research, facing challenges like limited open-source datasets, inadequate model representation, and gaps in fundamental NLP tasks compared to high-resource languages.

Method: The authors systematically examine existing Hausa NLP resources and research across key tasks (text classification, machine translation, named entity recognition, speech recognition, question answering), create HausaNLP catalog to aggregate datasets and tools, and analyze challenges in integrating Hausa into large language models including tokenization and dialectal variation issues.

Result: The paper presents a comprehensive overview of current Hausa NLP state, identifies research gaps across fundamental NLP tasks, and successfully introduces HausaNLP catalog (https://catalog.hausanlp.org) as a curated resource aggregation platform to enhance accessibility for researchers and practitioners.

Conclusion: The work establishes a foundation for accelerating Hausa NLP progress by proposing strategic research directions including dataset expansion, improved language modeling approaches, and strengthened community collaboration, while providing valuable insights applicable to broader multilingual NLP research for low-resource languages.

Abstract: Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

[67] Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang

Main category: cs.CL

TL;DR: This paper introduces Hierarchical Safety Realignment (HSR), a lightweight method to restore safety performance in Large Vision-Language Models after pruning by selectively identifying and restoring critical attention heads and neurons that are essential for maintaining safety.

Details

Motivation: Large Vision-Language Models require pruning for deployment in resource-constrained environments, but existing pruning techniques cause degradation in safety performance, creating a need for methods to maintain safety while achieving model compression.

Method: Hierarchical Safety Realignment (HSR) works by: 1) quantifying the contribution of each attention head to safety, 2) identifying the most critical attention heads for safety, 3) selectively restoring neurons within these critical attention heads that are pivotal for safety maintenance, operating hierarchically from attention head level to neuron level.

Result: HSR consistently achieves notable improvements in safety performance across various models and pruning strategies, successfully restoring safety in pruned LVLMs while maintaining the benefits of model compression.

Conclusion: This work presents the first approach explicitly focused on restoring safety in LVLMs post-pruning, demonstrating that hierarchical realignment can effectively address safety degradation issues while preserving the computational efficiency benefits of pruning.

Abstract: With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

[68] Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie

Main category: cs.CL

TL;DR: IOHFuseLM is a multimodal language model that predicts intraoperative hypotension by combining physiological time series and clinical data through a two-stage training approach with diffusion-augmented pretraining and token-level alignment.

Details

Motivation: Intraoperative hypotension (IOH) is strongly linked to adverse outcomes like myocardial injury and increased mortality, but prediction is challenging due to event sparsity and the difficulty of integrating static and dynamic data across diverse patients.

Method: The authors propose IOHFuseLM using: (1) two-stage training with domain adaptive pretraining on diffusion-augmented physiological time series followed by task fine-tuning, (2) token-level alignment of structured clinical descriptions with physiological time series for multimodal fusion, and (3) conversion of static patient attributes into structured text for personalization.

Result: Experimental evaluations on two intraoperative datasets show that IOHFuseLM outperforms established baselines in accurately identifying IOH events, demonstrating its effectiveness for clinical decision support scenarios.

Conclusion: IOHFuseLM successfully addresses the challenges of IOH prediction by effectively integrating multimodal data through language model frameworks, showing superior performance over existing methods and potential for clinical application.

Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

[69] Self-Correcting Code Generation Using Small Language Models

Jeonghun Cho, Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.CL

TL;DR: This paper introduces CoCoS, a reinforcement learning approach that enables small language models (1B parameters) to perform effective multi-turn code correction, achieving significant improvements on code generation benchmarks despite smaller models typically struggling with self-correction.

Details

Motivation: Smaller language models struggle to exhibit reflective revision behavior in self-correction for code generation, unlike larger proprietary models. The authors aim to enhance the self-correction capabilities of small models to make them more effective at iteratively improving their code outputs.

Method: CoCoS uses online reinforcement learning with: (1) an accumulated reward function that aggregates rewards across the entire correction trajectory, (2) a fine-grained reward system suited for multi-turn correction scenarios, and (3) training objectives that encourage models to maintain correct outputs while progressively fixing incorrect ones across multiple turns.

Result: CoCoS achieves substantial improvements with 1B-scale models: 35.8% improvement on MBPP benchmark and 27.7% improvement on HumanEval benchmark compared to baseline methods, demonstrating that small models can effectively perform self-correction when properly trained.

Conclusion: The study demonstrates that smaller language models can be successfully trained to perform effective self-correction in code generation through appropriate reinforcement learning techniques, bridging the gap between small and large models in multi-turn code correction capabilities.

Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

[70] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky

Main category: cs.CL

TL;DR: This paper releases new sense-annotated datasets for polysemous words in ten low-resource languages and presents a semi-automatic annotation method to evaluate cross-lingual transfer for Word-in-Context tasks, aiming to advance multilingual NLP research.

Details

Motivation: The critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer, as current effectiveness depends on quality benchmarks for understudied and typologically diverse languages.

Method: Development of a semi-automatic annotation method for creating sense-annotated datasets of sentences containing polysemous words, followed by Word-in-Context (WiC) formatted experiments to evaluate cross-lingual transfer performance.

Result: Successfully created and released sense-annotated datasets spanning ten low-resource languages across diverse language families and scripts, with demonstrated utility through WiC experiments that show the importance of targeted dataset creation for polysemy disambiguation.

Conclusion: The study highlights the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings, with the released datasets and code supporting further research into fair, robust, and truly multilingual NLP.

Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

[71] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

Main category: cs.CL

TL;DR: This paper introduces TCGBench, a benchmark to evaluate LLMs’ ability to generate test case generators for competitive programming problems, finding that while LLMs can create valid generators, they struggle with targeted test cases that expose bugs in human code.

Details

Motivation: The extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored, particularly in the context of competition-level programming problems.

Method: The authors propose TCGBench benchmark with two tasks: (1) generating valid test case generators for CP problems, and (2) generating targeted test case generators that expose bugs in human-written code. They also construct a manually curated dataset of instructions for generating targeted generators and evaluate performance through prompting and fine-tuning.

Result: State-of-the-art LLMs can generate valid test case generators in most cases, but struggle significantly with generating targeted test cases that reveal code flaws. Even advanced reasoning models like o3-mini fall short of human performance in targeted generator tasks. However, LLM performance can be enhanced using the curated dataset through both prompting and fine-tuning approaches.

Conclusion: While LLMs show promise in generating basic test case generators for competitive programming, there remains a significant gap in their ability to create targeted test cases for debugging purposes compared to human performance, though this can be partially addressed through specialized datasets and training approaches.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

[72] Continuously Updating Digital Twins using Large Language Models

Harry Amad, Nicolás Astorga, Mihaela van der Schaar

Main category: cs.CL

TL;DR: The paper introduces CALM-DT, a Context-Adaptive Language Model-based Digital Twin that uses in-context learning to continuously adapt to changing environments without requiring re-training or re-design, unlike traditional digital twin approaches.

Details

Motivation: Current digital twin approaches struggle with adaptability as they require fixed, well-defined modeling environments and cannot adapt to novel variables without re-designs or incorporate new information without re-training. Real-world systems constantly change their state/action variables and available data, requiring digital twins to continuously update to remain relevant.

Method: The authors frame digital twinning as an in-context learning problem using large language models. They develop CALM-DT (Context-Adaptive Language Model-based Digital Twin) that utilizes fine-tuned encoders for sample retrieval and can simulate across diverse state-action spaces using in-context learning alone, enabling seamless updates at inference time.

Result: CALM-DT demonstrates competitive performance with existing digital twin approaches while showing unique ability to adapt to changes in modeling environments without parameter updates. The system can accurately simulate across diverse state-action spaces using only in-context learning.

Conclusion: The paper successfully addresses the adaptability limitations of traditional digital twins by leveraging large language models and in-context learning, providing a solution that can continuously update and adapt to changing environments without requiring re-training or re-design.

Abstract: Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT’s competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.

[73] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: Researchers developed Agent KB, a shared knowledge base that enables AI agents to learn from each other’s problem-solving experiences through a dual-phase retrieval mechanism, achieving significant performance improvements on benchmarks like GAIA and SWE-bench.

Details

Motivation: Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks, limiting their ability to improve through knowledge transfer.

Method: Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement, creating a hierarchical approach for knowledge transfer across agent frameworks.

Result: On GAIA benchmark, Agent KB improved success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, o3-mini achieved an 8.67 percentage point gain (from 23% to 31.67%) in pass@1. Ablation studies showed the refinement module is most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks.

Conclusion: The hierarchical approach of combining strategic guidance with execution-level refinement proves essential for effective knowledge transfer between AI agents, demonstrating that both components are necessary for breaking agents out of limited reasoning pathways and achieving substantial performance improvements.

Abstract: Current AI agents cannot effectively learn from each other’s problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.

[74] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Sharon Silver, Ayzaan Wahid, Sergey Brin, Yves Raimond, Klemen Kloboves, Cindy Wang, Nitesh Bharadwaj Gundavarapu, Ilia Shumailov, Bo Wang, Mantas Pajarskas, Joe Heyward, Martin Nikoltchev, Maciej Kula, Hao Zhou, Zachary Garrett, Sushant Kafle, Sercan Arik, Ankita Goel, Mingyao Yang, Jiho Park, Koji Kojima, Parsa Mahmoudieh, Koray Kavukcuoglu, Grace Chen, Doug Fritz, Anton Bulyenov, Sudeshna Roy, Dimitris Paparas, Hadar Shemtov, Bo-Juen Chen, Robin Strudel, David Reitter, Aurko Roy, Andrey Vlasov, Changwan Ryu, Chas Leichner, Haichuan Yang, Zelda Mariet, Denis Vnukov, Tim Sohn, Amy Stuart, Wei Liang, Minmin Chen, Praynaa Rawlani, Christy Koh, JD Co-Reyes, Guangda Lai, Praseem Banzal, Dimitrios Vytiniotis, Jieru Mei, Mu Cai, Mohammed Badawi, Corey Fry, Ale Hartman, Daniel Zheng, Eric Jia, James Keeling, Annie Louis, Ying Chen, Efren Robles, Wei-Chih Hung, Howard Zhou, Nikita Saxena, Sonam Goenka, Olivia Ma, Zach Fisher, Mor Hazan Taege, Emily Graves, David Steiner, Yujia Li, Sarah Nguyen, Rahul Sukthankar, Joe Stanton, Ali Eslami, Gloria Shen, Berkin Akin, Alexey Guseynov, Yiqian Zhou, Jean-Baptiste Alayrac, Armand Joulin, Efrat Farkash, Ashish Thapliyal, Stephen Roller, Noam Shazeer, Todor Davchev, Terry Koo, Hannah Forbes-Pollard, Kartik Audhkhasi, Greg Farquhar, Adi Mayrav Gilady, Maggie Song, John Aslanides, Piermaria Mendolicchio, Alicia Parrish, John Blitzer, Pramod Gupta, Xiaoen Ju, Xiaochen Yang, Puranjay Datta, Andrea Tacchetti, Sanket Vaibhav Mehta, Gregory Dibb, Shubham Gupta, Federico Piccinini, Raia Hadsell, Sujee Rajayogam, Jiepu Jiang, Patrick Griffin, Patrik Sundberg, Jamie Hayes, Alexey Frolov, Tian Xie, Adam Zhang, Kingshuk Dasgupta, Uday Kalra, Lior Shani, Klaus Macherey, Tzu-Kuo Huang, Liam MacDermed, Karthik Duddu, Paulo Zacchello, Zi Yang, Jessica Lo, Kai Hui, Matej Kastelic, Derek Gasaway, Qijun Tan, Summer Yue, Pablo Barrio, John Wieting, Weel Yang, Andrew Nystrom, Solomon Demmessie, Anselm Levskaya, Fabio Viola, Chetan Tekur, Greg Billock, George Necula, Mandar Joshi, Rylan Schaeffer, Swachhand Lokhande, Christina Sorokin, Pradeep Shenoy, Mia Chen, Mark Collier, Hongji Li, Taylor Bos, Nevan Wichers, Sun Jae Lee, Angéline Pouget, Santhosh Thangaraj, Kyriakos Axiotis, Phil Crone, Rachel Sterneck, Nikolai Chinaev, Victoria Krakovna, Oleksandr Ferludin, Ian Gemp, Stephanie Winkler, Dan Goldberg, Ivan Korotkov, Kefan Xiao, Malika Mehrotra, Sandeep Mariserla, Vihari Piratla, Terry Thurk, Khiem Pham, Hongxu Ma, Alexandre Senges, Ravi Kumar, Clemens Meyer, Ellie Talius, Nuo Wang Pierse, Ballie Sandhu, Horia Toma, Kuo Lin, Swaroop Nath, Tom Stone, Dorsa Sadigh, Nikita Gupta, Arthur Guez, Avi Singh, Matt Thomas, Tom Duerig, Yuan Gong, Richard Tanburn, Lydia Lihui Zhang, Phuong Dao, Mohamed Hammad, Sirui Xie, Shruti Rijhwani, Ben Murdoch, Duhyeon Kim, Will Thompson, Heng-Tze Cheng, Daniel Sohn, Pablo Sprechmann, Qiantong Xu, Srinivas Tadepalli, Peter Young, Ye Zhang, Hansa Srinivasan, Miranda Aperghis, Aditya Ayyar, Hen Fitoussi, Ryan Burnell, David Madras, Mike Dusenberry, Xi Xiong, Tayo Oguntebi, Ben Albrecht, Jörg Bornschein, Jovana Mitrović, Mason Dimarco, Bhargav Kanagal Shamanna, Premal Shah, Eren Sezener, Shyam Upadhyay, Dave Lacey, Craig Schiff, Sebastien Baur, Sanjay Ganapathy, Eva Schnider, Mateo Wirth, Connor Schenck, Andrey Simanovsky, Yi-Xuan Tan, Philipp Fränken, Dennis Duan, Bharath Mankalale, Nikhil Dhawan, Kevin Sequeira, Zichuan Wei, Shivanker Goel, Caglar Unlu, Yukun Zhu, Haitian Sun, Ananth Balashankar, Kurt Shuster, Megh Umekar, Mahmoud Alnahlawi, Aäron van den Oord, Kelly Chen, Yuexiang Zhai, Zihang Dai, Kuang-Huei Lee, Eric Doi, Lukas Zilka, Rohith Vallu, Disha Shrivastava, Jason Lee, Hisham Husain, Honglei Zhuang, Vincent Cohen-Addad, Jarred Barber, James Atwood, Adam Sadovsky, Quentin Wellens, Steven Hand, Arunkumar Rajendran, Aybuke Turker, CJ Carey, Yuanzhong Xu, Hagen Soltau, Zefei Li, Xinying Song, Conglong Li, Iurii Kemaev, Sasha Brown, Andrea Burns, Viorica Patraucean, Piotr Stanczyk, Renga Aravamudhan, Mathieu Blondel, Hila Noga, Lorenzo Blanco, Will Song, Michael Isard, Mandar Sharma, Reid Hayes, Dalia El Badawy, Avery Lamp, Itay Laish, Olga Kozlova, Kelvin Chan, Sahil Singla, Srinivas Sunkara, Mayank Upadhyay, Chang Liu, Aijun Bai, Jarek Wilkiewicz, Martin Zlocha, Jeremiah Liu, Zhuowan Li, Haiguang Li, Omer Barak, Ganna Raboshchuk, Jiho Choi, Fangyu Liu, Erik Jue, Mohit Sharma, Andreea Marzoca, Robert Busa-Fekete, Anna Korsun, Andre Elisseeff, Zhe Shen, Sara Mc Carthy, Kay Lamerigts, Anahita Hosseini, Hanzhao Lin, Charlie Chen, Fan Yang, Kushal Chauhan, Mark Omernick, Dawei Jia, Karina Zainullina, Demis Hassabis, Danny Vainstein, Ehsan Amid, Xiang Zhou, Ronny Votel, Eszter Vértes, Xinjian Li, Zongwei Zhou, Angeliki Lazaridou, Brendan McMahan, Arjun Narayanan, Hubert Soyer, Sujoy Basu, Kayi Lee, Bryan Perozzi, Qin Cao, Leonard Berrada, Rahul Arya, Ke Chen, Katrina, Xu, Matthias Lochbrunner, Alex Hofer, Sahand Sharifzadeh, Renjie Wu, Sally Goldman, Pranjal Awasthi, Xuezhi Wang, Yan Wu, Claire Sha, Biao Zhang, Maciej Mikuła, Filippo Graziano, Siobhan Mcloughlin, Irene Giannoumis, Youhei Namiki, Chase Malik, Carey Radebaugh, Jamie Hall, Ramiro Leal-Cavazos, Jianmin Chen, Vikas Sindhwani, David Kao, David Greene, Jordan Griffith, Chris Welty, Ceslee Montgomery, Toshihiro Yoshino, Liangzhe Yuan, Noah Goodman, Assaf Hurwitz Michaely, Kevin Lee, KP Sawhney, Wei Chen, Zheng Zheng, Megan Shum, Nikolay Savinov, Etienne Pot, Alex Pak, Morteza Zadimoghaddam, Sijal Bhatnagar, Yoad Lewenberg, Blair Kutzman, Ji Liu, Lesley Katzen, Jeremy Selier, Josip Djolonga, Dmitry Lepikhin, Kelvin Xu, Jacky Liang, Jiewen Tan, Benoit Schillings, Muge Ersoy, Pete Blois, Bernd Bandemer, Abhimanyu Singh, Sergei Lebedev, Pankaj Joshi, Adam R. Brown, Evan Palmer, Shreya Pathak, Komal Jalan, Fedir Zubach, Shuba Lall, Randall Parker, Alok Gunjan, Sergey Rogulenko, Sumit Sanghai, Zhaoqi Leng, Zoltan Egyed, Shixin Li, Maria Ivanova, Kostas Andriopoulos, Jin Xie, Elan Rosenfeld, Auriel Wright, Ankur Sharma, Xinyang Geng, Yicheng Wang, Sam Kwei, Renke Pan, Yujing Zhang, Gabby Wang, Xi Liu, Chak Yeung, Elizabeth Cole, Aviv Rosenberg, Zhen Yang, Phil Chen, George Polovets, Pranav Nair, Rohun Saxena, Josh Smith, Shuo-yiin Chang, Aroma Mahendru, Svetlana Grant, Anand Iyer, Irene Cai, Jed McGiffin, Jiaming Shen, Alanna Walton, Antonious Girgis, Oliver Woodman, Rosemary Ke, Mike Kwong, Louis Rouillard, Jinmeng Rao, Zhihao Li, Yuntao Xu, Flavien Prost, Chi Zou, Ziwei Ji, Alberto Magni, Tyler Liechty, Dan A. Calian, Deepak Ramachandran, Igor Krivokon, Hui Huang, Terry Chen, Anja Hauth, Anastasija Ilić, Weijuan Xi, Hyeontaek Lim, Vlad-Doru Ion, Pooya Moradi, Metin Toksoz-Exley, Kalesha Bullard, Miltos Allamanis, Xiaomeng Yang, Sophie Wang, Zhi Hong, Anita Gergely, Cheng Li, Bhavishya Mittal, Vitaly Kovalev, Victor Ungureanu, Jane Labanowski, Jan Wassenberg, Nicolas Lacasse, Geoffrey Cideron, Petar Dević, Annie Marsden, Lynn Nguyen, Michael Fink, Yin Zhong, Tatsuya Kiyono, Desi Ivanov, Sally Ma, Max Bain, Kiran Yalasangi, Jennifer She, Anastasia Petrushkina, Mayank Lunayach, Carla Bromberg, Sarah Hodkinson, Vilobh Meshram, Daniel Vlasic, Austin Kyker, Steve Xu, Jeff Stanway, Zuguang Yang, Kai Zhao, Matthew Tung, Seth Odoom, Yasuhisa Fujii, Justin Gilmer, Eunyoung Kim, Felix Halim, Quoc Le, Bernd Bohnet, Seliem El-Sayed, Behnam Neyshabur, Malcolm Reynolds, Dean Reich, Yang Xu, Erica Moreira, Anuj Sharma, Zeyu Liu, Mohammad Javad Hosseini, Naina Raisinghani, Yi Su, Ni Lao, Daniel Formoso, Marco Gelmi, Almog Gueta, Tapomay Dey, Elena Gribovskaya, Domagoj Ćevid, Sidharth Mudgal, Garrett Bingham, Jianling Wang, Anurag Kumar, Alex Cullum, Feng Han, Konstantinos Bousmalis, Diego Cedillo, Grace Chu, Vladimir Magay, Paul Michel, Ester Hlavnova, Daniele Calandriello, Setareh Ariafar, Kaisheng Yao, Vikash Sehwag, Arpi Vezer, Agustin Dal Lago, Zhenkai Zhu, Paul Kishan Rubenstein, Allen Porter, Anirudh Baddepudi, Oriana Riva, Mihai Dorin Istin, Chih-Kuan Yeh, Zhi Li, Andrew Howard, Nilpa Jha, Jeremy Chen, Raoul de Liedekerke, Zafarali Ahmed, Mikel Rodriguez, Tanuj Bhatia, Bangju Wang, Ali Elqursh, David Klinghoffer, Peter Chen, Pushmeet Kohli, Te I, Weiyang Zhang, Zack Nado, Jilin Chen, Maxwell Chen, George Zhang, Aayush Singh, Adam Hillier, Federico Lebron, Yiqing Tao, Ting Liu, Gabriel Dulac-Arnold, Jingwei Zhang, Shashi Narayan, Buhuang Liu, Orhan Firat, Abhishek Bhowmick, Bingyuan Liu, Hao Zhang, Zizhao Zhang, Georges Rotival, Nathan Howard, Anu Sinha, Alexander Grushetsky, Benjamin Beyret, Keerthana Gopalakrishnan, James Zhao, Kyle He, Szabolcs Payrits, Zaid Nabulsi, Zhaoyi Zhang, Weijie Chen, Edward Lee, Nova Fallen, Sreenivas Gollapudi, Aurick Zhou, Filip Pavetić, Thomas Köppe, Shiyu Huang, Rama Pasumarthi, Nick Fernando, Felix Fischer, Daria Ćurko, Yang Gao, James Svensson, Austin Stone, Haroon Qureshi, Abhishek Sinha, Apoorv Kulshreshtha, Martin Matysiak, Jieming Mao, Carl Saroufim, Aleksandra Faust, Qingnan Duan, Gil Fidel, Kaan Katircioglu, Raphaël Lopez Kaufman, Dhruv Shah, Weize Kong, Abhishek Bapna, Gellért Weisz, Emma Dunleavy, Praneet Dutta, Tianqi Liu, Rahma Chaabouni, Carolina Parada, Marcus Wu, Alexandra Belias, Alessandro Bissacco, Stanislav Fort, Li Xiao, Fantine Huot, Chris Knutsen, Yochai Blau, Gang Li, Jennifer Prendki, Juliette Love, Yinlam Chow, Pichi Charoenpanit, Hidetoshi Shimokawa, Vincent Coriou, Karol Gregor, Tomas Izo, Arjun Akula, Mario Pinto, Chris Hahn, Dominik Paulus, Jiaxian Guo, Neha Sharma, Cho-Jui Hsieh, Adaeze Chukwuka, Kazuma Hashimoto, Nathalie Rauschmayr, Ling Wu, Christof Angermueller, Yulong Wang, Sebastian Gerlach, Michael Pliskin, Daniil Mirylenka, Min Ma, Lexi Baugher, Bryan Gale, Shaan Bijwadia, Nemanja Rakićević, David Wood, Jane Park, Chung-Ching Chang, Babi Seal, Chris Tar, Kacper Krasowiak, Yiwen Song, Georgi Stephanov, Gary Wang, Marcello Maggioni, Stein Xudong Lin, Felix Wu, Shachi Paul, Zixuan Jiang, Shubham Agrawal, Bilal Piot, Alex Feng, Cheolmin Kim, Tulsee Doshi, Jonathan Lai, Chuqiao, Xu, Sharad Vikram, Ciprian Chelba, Sebastian Krause, Vincent Zhuang, Jack Rae, Timo Denk, Adrian Collister, Lotte Weerts, Xianghong Luo, Yifeng Lu, Håvard Garnes, Nitish Gupta, Terry Spitz, Avinatan Hassidim, Lihao Liang, Izhak Shafran, Peter Humphreys, Kenny Vassigh, Phil Wallis, Virat Shejwalkar, Nicolas Perez-Nieves, Rachel Hornung, Melissa Tan, Beka Westberg, Andy Ly, Richard Zhang, Brian Farris, Jongbin Park, Alec Kosik, Zeynep Cankara, Andrii Maksai, Yunhan Xu, Albin Cassirer, Sergi Caelles, Abbas Abdolmaleki, Mencher Chiang, Alex Fabrikant, Shravya Shetty, Luheng He, Mai Giménez, Hadi Hashemi, Sheena Panthaplackel, Yana Kulizhskaya, Salil Deshmukh, Daniele Pighin, Robin Alazard, Disha Jindal, Seb Noury, Pradeep Kumar S, Siyang Qin, Xerxes Dotiwalla, Stephen Spencer, Mohammad Babaeizadeh, Blake JianHang Chen, Vaibhav Mehta, Jennie Lees, Andrew Leach, Penporn Koanantakool, Ilia Akolzin, Ramona Comanescu, Junwhan Ahn, Alexey Svyatkovskiy, Basil Mustafa, David D’Ambrosio, Shiva Mohan Reddy Garlapati, Pascal Lamblin, Alekh Agarwal, Shuang Song, Pier Giuseppe Sessa, Pauline Coquinot, John Maggs, Hussain Masoom, Divya Pitta, Yaqing Wang, Patrick Morris-Suzuki, Billy Porter, Johnson Jia, Jeffrey Dudek, Raghavender R, Cosmin Paduraru, Alan Ansell, Tolga Bolukbasi, Tony Lu, Ramya Ganeshan, Zi Wang, Henry Griffiths, Rodrigo Benenson, Yifan He, James Swirhun, George Papamakarios, Aditya Chawla, Kuntal Sengupta, Yan Wang, Vedrana Milutinovic, Igor Mordatch, Zhipeng Jia, Jamie Smith, Will Ng, Shitij Nigam, Matt Young, Eugen Vušak, Blake Hechtman, Sheela Goenka, Avital Zipori, Kareem Ayoub, Ashok Popat, Trilok Acharya, Luo Yu, Dawn Bloxwich, Hugo Song, Paul Roit, Haiqiong Li, Aviel Boag, Nigamaa Nayakanti, Bilva Chandra, Tianli Ding, Aahil Mehta, Cath Hope, Jiageng Zhang, Idan Heimlich Shtacher, Kartikeya Badola, Ryo Nakashima, Andrei Sozanschi, Iulia Comşa, Ante Žužul, Emily Caveness, Julian Odell, Matthew Watson, Dario de Cesare, Phillip Lippe, Derek Lockhart, Siddharth Verma, Huizhong Chen, Sean Sun, Lin Zhuo, Aditya Shah, Prakhar Gupta, Alex Muzio, Ning Niu, Amir Zait, Abhinav Singh, Meenu Gaba, Fan Ye, Prajit Ramachandran, Mohammad Saleh, Raluca Ada Popa, Ayush Dubey, Frederick Liu, Sara Javanmardi, Mark Epstein, Ross Hemsley, Richard Green, Nishant Ranka, Eden Cohen, Chuyuan Kelly Fu, Sanjay Ghemawat, Jed Borovik, James Martens, Anthony Chen, Pranav Shyam, André Susano Pinto, Ming-Hsuan Yang, Alexandru Ţifrea, David Du, Boqing Gong, Ayushi Agarwal, Seungyeon Kim, Christian Frank, Saloni Shah, Xiaodan Song, Zhiwei Deng, Ales Mikhalap, Kleopatra Chatziprimou, Timothy Chung, Toni Creswell, Susan Zhang, Yennie Jun, Carl Lebsack, Will Truong, Slavica Andačić, Itay Yona, Marco Fornoni, Rong Rong, Serge Toropov, Afzal Shama Soudagar, Andrew Audibert, Salah Zaiem, Zaheer Abbas, Andrei Rusu, Sahitya Potluri, Shitao Weng, Anastasios Kementsietsidis, Anton Tsitsulin, Daiyi Peng, Natalie Ha, Sanil Jain, Tejasi Latkar, Simeon Ivanov, Cory McLean, Anirudh GP, Rajesh Venkataraman, Canoee Liu, Dilip Krishnan, Joel D’sa, Roey Yogev, Paul Collins, Benjamin Lee, Lewis Ho, Carl Doersch, Gal Yona, Shawn Gao, Felipe Tiengo Ferreira, Adnan Ozturel, Hannah Muckenhirn, Ce Zheng, Gargi Balasubramaniam, Mudit Bansal, George van den Driessche, Sivan Eiger, Salem Haykal, Vedant Misra, Abhimanyu Goyal, Danilo Martins, Gary Leung, Jonas Valfridsson, Four Flynn, Will Bishop, Chenxi Pang, Yoni Halpern, Honglin Yu, Lawrence Moore, Yuvein, Zhu, Sridhar Thiagarajan, Yoel Drori, Zhisheng Xiao, Lucio Dery, Rolf Jagerman, Jing Lu, Eric Ge, Vaibhav Aggarwal, Arjun Khare, Vinh Tran, Oded Elyada, Ferran Alet, James Rubin, Ian Chou, David Tian, Libin Bai, Lawrence Chan, Lukasz Lew, Karolis Misiunas, Taylan Bilal, Aniket Ray, Sindhu Raghuram, Alex Castro-Ros, Viral Carpenter, CJ Zheng, Michael Kilgore, Josef Broder, Emily Xue, Praveen Kallakuri, Dheeru Dua, Nancy Yuen, Steve Chien, John Schultz, Saurabh Agrawal, Reut Tsarfaty, Jingcao Hu, Ajay Kannan, Dror Marcus, Nisarg Kothari, Baochen Sun, Ben Horn, Matko Bošnjak, Ferjad Naeem, Dean Hirsch, Lewis Chiang, Boya Fang, Jie Han, Qifei Wang, Ben Hora, Antoine He, Mario Lučić, Beer Changpinyo, Anshuman Tripathi, John Youssef, Chester Kwak, Philippe Schlattner, Cat Graves, Rémi Leblond, Wenjun Zeng, Anders Andreassen, Gabriel Rasskin, Yue Song, Eddie Cao, Junhyuk Oh, Matt Hoffman, Wojtek Skut, Yichi Zhang, Jon Stritar, Xingyu Cai, Saarthak Khanna, Kathie Wang, Shriya Sharma, Christian Reisswig, Younghoon Jun, Aman Prasad, Tatiana Sholokhova, Preeti Singh, Adi Gerzi Rosenthal, Anian Ruoss, Françoise Beaufays, Sean Kirmani, Dongkai Chen, Johan Schalkwyk, Jonathan Herzig, Been Kim, Josh Jacob, Damien Vincent, Adrian N Reyes, Ivana Balazevic, Léonard Hussenot, Jon Schneider, Parker Barnes, Luis Castro, Spandana Raj Babbula, Simon Green, Serkan Cabi, Nico Duduta, Danny Driess, Rich Galt, Noam Velan, Junjie Wang, Hongyang Jiao, Matthew Mauger, Du Phan, Miteyan Patel, Vlado Galić, Jerry Chang, Eyal Marcus, Matt Harvey, Julian Salazar, Elahe Dabir, Suraj Satishkumar Sheth, Amol Mandhane, Hanie Sedghi, Jeremiah Willcock, Amir Zandieh, Shruthi Prabhakara, Aida Amini, Antoine Miech, Victor Stone, Massimo Nicosia, Paul Niemczyk, Ying Xiao, Lucy Kim, Sławek Kwasiborski, Vikas Verma, Ada Maksutaj Oflazer, Christoph Hirnschall, Peter Sung, Lu Liu, Richard Everett, Michiel Bakker, Ágoston Weisz, Yufei Wang, Vivek Sampathkumar, Uri Shaham, Bibo Xu, Yasemin Altun, Mingqiu Wang, Takaaki Saeki, Guanjie Chen, Emanuel Taropa, Shanthal Vasanth, Sophia Austin, Lu Huang, Goran Petrovic, Qingyun Dou, Daniel Golovin, Grigory Rozhdestvenskiy, Allie Culp, Will Wu, Motoki Sano, Divya Jain, Julia Proskurnia, Sébastien Cevey, Alejandro Cruzado Ruiz, Piyush Patil, Mahdi Mirzazadeh, Eric Ni, Javier Snaider, Lijie Fan, Alexandre Fréchette, AJ Pierigiovanni, Shariq Iqbal, Kenton Lee, Claudio Fantacci, Jinwei Xing, Lisa Wang, Alex Irpan, David Raposo, Yi Luan, Zhuoyuan Chen, Harish Ganapathy, Kevin Hui, Jiazhong Nie, Isabelle Guyon, Heming Ge, Roopali Vij, Hui Zheng, Dayeong Lee, Alfonso Castaño, Khuslen Baatarsukh, Gabriel Ibagon, Alexandra Chronopoulou, Nicholas FitzGerald, Shashank Viswanadha, Safeen Huda, Rivka Moroshko, Georgi Stoyanov, Prateek Kolhar, Alain Vaucher, Ishaan Watts, Adhi Kuncoro, Henryk Michalewski, Satish Kambala, Bat-Orgil Batsaikhan, Alek Andreev, Irina Jurenka, Maigo Le, Qihang Chen, Wael Al Jishi, Sarah Chakera, Zhe Chen, Aditya Kini, Vikas Yadav, Aditya Siddhant, Ilia Labzovsky, Balaji Lakshminarayanan, Carrie Grimes Bostock, Pankil Botadra, Ankesh Anand, Colton Bishop, Sam Conway-Rahman, Mohit Agarwal, Yani Donchev, Achintya Singhal, Félix de Chaumont Quitry, Natalia Ponomareva, Nishant Agrawal, Bin Ni, Kalpesh Krishna, Masha Samsikova, John Karro, Yilun Du, Tamara von Glehn, Caden Lu, Christopher A. Choquette-Choo, Zhen Qin, Tingnan Zhang, Sicheng Li, Divya Tyam, Swaroop Mishra, Wing Lowe, Colin Ji, Weiyi Wang, Manaal Faruqui, Ambrose Slone, Valentin Dalibard, Arunachalam Narayanaswamy, John Lambert, Pierre-Antoine Manzagol, Dan Karliner, Andrew Bolt, Ivan Lobov, Aditya Kusupati, Chang Ye, Xuan Yang, Heiga Zen, Nelson George, Mukul Bhutani, Olivier Lacombe, Robert Riachi, Gagan Bansal, Rachel Soh, Yue Gao, Yang Yu, Adams Yu, Emily Nottage, Tania Rojas-Esponda, James Noraky, Manish Gupta, Ragha Kotikalapudi, Jichuan Chang, Sanja Deur, Dan Graur, Alex Mossin, Erin Farnese, Ricardo Figueira, Alexandre Moufarek, Austin Huang, Patrik Zochbauer, Ben Ingram, Tongzhou Chen, Zelin Wu, Adrià Puigdomènech, Leland Rechis, Da Yu, Sri Gayatri Sundara Padmanabhan, Rui Zhu, Chu-ling Ko, Andrea Banino, Samira Daruki, Aarush Selvan, Dhruva Bhaswar, Daniel Hernandez Diaz, Chen Su, Salvatore Scellato, Jennifer Brennan, Woohyun Han, Grace Chung, Priyanka Agrawal, Urvashi Khandelwal, Khe Chai Sim, Morgane Lustman, Sam Ritter, Kelvin Guu, Jiawei Xia, Prateek Jain, Emma Wang, Tyrone Hill, Mirko Rossini, Marija Kostelac, Tautvydas Misiunas, Amit Sabne, Kyuyeun Kim, Ahmet Iscen, Congchao Wang, José Leal, Ashwin Sreevatsa, Utku Evci, Manfred Warmuth, Saket Joshi, Daniel Suo, James Lottes, Garrett Honke, Brendan Jou, Stefani Karp, Jieru Hu, Himanshu Sahni, Adrien Ali Taïga, William Kong, Samrat Ghosh, Renshen Wang, Jay Pavagadhi, Natalie Axelsson, Nikolai Grigorev, Patrick Siegler, Rebecca Lin, Guohui Wang, Emilio Parisotto, Sharath Maddineni, Krishan Subudhi, Eyal Ben-David, Elena Pochernina, Orgad Keller, Thi Avrahami, Zhe Yuan, Pulkit Mehta, Jialu Liu, Sherry Yang, Wendy Kan, Katherine Lee, Tom Funkhouser, Derek Cheng, Hongzhi Shi, Archit Sharma, Joe Kelley, Matan Eyal, Yury Malkov, Corentin Tallec, Yuval Bahat, Shen Yan, Xintian, Wu, David Lindner, Chengda Wu, Avi Caciularu, Xiyang Luo, Rodolphe Jenatton, Tim Zaman, Yingying Bi, Ilya Kornakov, Ganesh Mallya, Daisuke Ikeda, Itay Karo, Anima Singh, Colin Evans, Praneeth Netrapalli, Vincent Nallatamby, Isaac Tian, Yannis Assael, Vikas Raunak, Victor Carbune, Ioana Bica, Lior Madmoni, Dee Cattle, Snchit Grover, Krishna Somandepalli, Sid Lall, Amelio Vázquez-Reina, Riccardo Patana, Jiaqi Mu, Pranav Talluri, Maggie Tran, Rajeev Aggarwal, RJ Skerry-Ryan, Jun Xu, Mike Burrows, Xiaoyue Pan, Edouard Yvinec, Di Lu, Zhiying Zhang, Duc Dung Nguyen, Hairong Mu, Gabriel Barcik, Helen Ran, Lauren Beltrone, Krzysztof Choromanski, Dia Kharrat, Samuel Albanie, Sean Purser-haskell, David Bieber, Carrie Zhang, Jing Wang, Tom Hudson, Zhiyuan Zhang, Han Fu, Johannes Mauerer, Mohammad Hossein Bateni, AJ Maschinot, Bing Wang, Muye Zhu, Arjun Pillai, Tobias Weyand, Shuang Liu, Oscar Akerlund, Fred Bertsch, Vittal Premachandran, Alicia Jin, Vincent Roulet, Peter de Boursac, Shubham Mittal, Ndaba Ndebele, Georgi Karadzhov, Sahra Ghalebikesabi, Ricky Liang, Allen Wu, Yale Cong, Nimesh Ghelani, Sumeet Singh, Bahar Fatemi, Warren, Chen, Charles Kwong, Alexey Kolganov, Steve Li, Richard Song, Chenkai Kuang, Sobhan Miryoosefi, Dale Webster, James Wendt, Arkadiusz Socala, Guolong Su, Artur Mendonça, Abhinav Gupta, Xiaowei Li, Tomy Tsai, Qiong, Hu, Kai Kang, Angie Chen, Sertan Girgin, Yongqin Xian, Andrew Lee, Nolan Ramsden, Leslie Baker, Madeleine Clare Elish, Varvara Krayvanova, Rishabh Joshi, Jiri Simsa, Yao-Yuan Yang, Piotr Ambroszczyk, Dipankar Ghosh, Arjun Kar, Yuan Shangguan, Yumeya Yamamori, Yaroslav Akulov, Andy Brock, Haotian Tang, Siddharth Vashishtha, Rich Munoz, Andreas Steiner, Kalyan Andra, Daniel Eppens, Qixuan Feng, Hayato Kobayashi, Sasha Goldshtein, Mona El Mahdy, Xin Wang, Jilei, Wang, Richard Killam, Tom Kwiatkowski, Kavya Kopparapu, Serena Zhan, Chao Jia, Alexei Bendebury, Sheryl Luo, Adrià Recasens, Timothy Knight, Jing Chen, Mohak Patel, YaGuang Li, Ben Withbroe, Dean Weesner, Kush Bhatia, Jie Ren, Danielle Eisenbud, Ebrahim Songhori, Yanhua Sun, Travis Choma, Tasos Kementsietsidis, Lucas Manning, Brian Roark, Wael Farhan, Jie Feng, Susheel Tatineni, James Cobon-Kerr, Yunjie Li, Lisa Anne Hendricks, Isaac Noble, Chris Breaux, Nate Kushman, Liqian Peng, Fuzhao Xue, Taylor Tobin, Jamie Rogers, Josh Lipschultz, Chris Alberti, Alexey Vlaskin, Mostafa Dehghani, Roshan Sharma, Tris Warkentin, Chen-Yu Lee, Benigno Uria, Da-Cheng Juan, Angad Chandorkar, Hila Sheftel, Ruibo Liu, Elnaz Davoodi, Borja De Balle Pigem, Kedar Dhamdhere, David Ross, Jonathan Hoech, Mahdis Mahdieh, Li Liu, Qiujia Li, Liam McCafferty, Chenxi Liu, Markus Mircea, Yunting Song, Omkar Savant, Alaa Saade, Colin Cherry, Vincent Hellendoorn, Siddharth Goyal, Paul Pucciarelli, David Vilar Torres, Zohar Yahav, Hyo Lee, Lars Lowe Sjoesund, Christo Kirov, Bo Chang, Deepanway Ghoshal, Lu Li, Gilles Baechler, Sébastien Pereira, Tara Sainath, Anudhyan Boral, Dominik Grewe, Afief Halumi, Nguyet Minh Phu, Tianxiao Shen, Marco Tulio Ribeiro, Dhriti Varma, Alex Kaskasoli, Vlad Feinberg, Navneet Potti, Jarrod Kahn, Matheus Wisniewski, Shakir Mohamed, Arnar Mar Hrafnkelsson, Bobak Shahriari, Jean-Baptiste Lespiau, Lisa Patel, Legg Yeung, Tom Paine, Lantao Mei, Alex Ramirez, Rakesh Shivanna, Li Zhong, Josh Woodward, Guilherme Tubone, Samira Khan, Heng Chen, Elizabeth Nielsen, Catalin Ionescu, Utsav Prabhu, Mingcen Gao, Qingze Wang, Sean Augenstein, Neesha Subramaniam, Jason Chang, Fotis Iliopoulos, Jiaming Luo, Myriam Khan, Weicheng Kuo, Denis Teplyashin, Florence Perot, Logan Kilpatrick, Amir Globerson, Hongkun Yu, Anfal Siddiqui, Nick Sukhanov, Arun Kandoor, Umang Gupta, Marco Andreetto, Moran Ambar, Donnie Kim, Paweł Wesołowski, Sarah Perrin, Ben Limonchik, Wei Fan, Jim Stephan, Ian Stewart-Binks, Ryan Kappedal, Tong He, Sarah Cogan, Romina Datta, Tong Zhou, Jiayu Ye, Leandro Kieliger, Ana Ramalho, Kyle Kastner, Fabian Mentzer, Wei-Jen Ko, Arun Suggala, Tianhao Zhou, Shiraz Butt, Hana Strejček, Lior Belenki, Subhashini Venugopalan, Mingyang Ling, Evgenii Eltyshev, Yunxiao Deng, Geza Kovacs, Mukund Raghavachari, Hanjun Dai, Tal Schuster, Steven Schwarcz, Richard Nguyen, Arthur Nguyen, Gavin Buttimore, Shrestha Basu Mallick, Sudeep Gandhe, Seth Benjamin, Michal Jastrzebski, Le Yan, Sugato Basu, Chris Apps, Isabel Edkins, James Allingham, Immanuel Odisho, Tomas Kocisky, Jewel Zhao, Linting Xue, Apoorv Reddy, Chrysovalantis Anastasiou, Aviel Atias, Sam Redmond, Kieran Milan, Nicolas Heess, Herman Schmit, Allan Dafoe, Daniel Andor, Tynan Gangwani, Anca Dragan, Sheng Zhang, Ashyana Kachra, Gang Wu, Siyang Xue, Kevin Aydin, Siqi Liu, Yuxiang Zhou, Mahan Malihi, Austin Wu, Siddharth Gopal, Candice Schumann, Peter Stys, Alek Wang, Mirek Olšák, Dangyi Liu, Christian Schallhart, Yiran Mao, Demetra Brady, Hao Xu, Tomas Mery, Chawin Sitawarin, Siva Velusamy, Tom Cobley, Alex Zhai, Christian Walder, Nitzan Katz, Ganesh Jawahar, Chinmay Kulkarni, Antoine Yang, Adam Paszke, Yinan Wang, Bogdan Damoc, Zalán Borsos, Ray Smith, Jinning Li, Mansi Gupta, Andrei Kapishnikov, Sushant Prakash, Florian Luisier, Rishabh Agarwal, Will Grathwohl, Kuangyuan Chen, Kehang Han, Nikhil Mehta, Andrew Over, Shekoofeh Azizi, Lei Meng, Niccolò Dal Santo, Kelvin Zheng, Jane Shapiro, Igor Petrovski, Jeffrey Hui, Amin Ghafouri, Jasper Snoek, James Qin, Mandy Jordan, Caitlin Sikora, Jonathan Malmaud, Yuheng Kuang, Aga Świetlik, Ruoxin Sang, Chongyang Shi, Leon Li, Andrew Rosenberg, Shubin Zhao, Andy Crawford, Jan-Thorsten Peter, Yun Lei, Xavier Garcia, Long Le, Todd Wang, Julien Amelot, Dave Orr, Praneeth Kacham, Dana Alon, Gladys Tyen, Abhinav Arora, James Lyon, Alex Kurakin, Mimi Ly, Theo Guidroz, Zhipeng Yan, Rina Panigrahy, Pingmei Xu, Thais Kagohara, Yong Cheng, Eric Noland, Jinhyuk Lee, Jonathan Lee, Cathy Yip, Maria Wang, Efrat Nehoran, Alexander Bykovsky, Zhihao Shan, Ankit Bhagatwala, Chaochao Yan, Jie Tan, Guillermo Garrido, Dan Ethier, Nate Hurley, Grace Vesom, Xu Chen, Siyuan Qiao, Abhishek Nayyar, Julian Walker, Paramjit Sandhu, Mihaela Rosca, Danny Swisher, Mikhail Dektiarev, Josh Dillon, George-Cristian Muraru, Manuel Tragut, Artiom Myaskovsky, David Reid, Marko Velic, Owen Xiao, Jasmine George, Mark Brand, Jing Li, Wenhao Yu, Shane Gu, Xiang Deng, François-Xavier Aubet, Soheil Hassas Yeganeh, Fred Alcober, Celine Smith, Trevor Cohn, Kay McKinney, Michael Tschannen, Ramesh Sampath, Gowoon Cheon, Liangchen Luo, Luyang Liu, Jordi Orbay, Hui Peng, Gabriela Botea, Xiaofan Zhang, Charles Yoon, Cesar Magalhaes, Paweł Stradomski, Ian Mackinnon, Steven Hemingray, Kumaran Venkatesan, Rhys May, Jaeyoun Kim, Alex Druinsky, Jingchen Ye, Zheng Xu, Terry Huang, Jad Al Abdallah, Adil Dostmohamed, Rachana Fellinger, Tsendsuren Munkhdalai, Akanksha Maurya, Peter Garst, Yin Zhang, Maxim Krikun, Simon Bucher, Aditya Srikanth Veerubhotla, Yaxin Liu, Sheng Li, Nishesh Gupta, Jakub Adamek, Hanwen Chen, Bernett Orlando, Aleksandr Zaks, Joost van Amersfoort, Josh Camp, Hui Wan, HyunJeong Choe, Zhichun Wu, Kate Olszewska, Weiren Yu, Archita Vadali, Martin Scholz, Daniel De Freitas, Jason Lin, Amy Hua, Xin Liu, Frank Ding, Yichao Zhou, Boone Severson, Katerina Tsihlas, Samuel Yang, Tammo Spalink, Varun Yerram, Helena Pankov, Rory Blevins, Ben Vargas, Sarthak Jauhari, Matt Miecnikowski, Ming Zhang, Sandeep Kumar, Clement Farabet, Charline Le Lan, Sebastian Flennerhag, Yonatan Bitton, Ada Ma, Arthur Bražinskas, Eli Collins, Niharika Ahuja, Sneha Kudugunta, Anna Bortsova, Minh Giang, Wanzheng Zhu, Ed Chi, Scott Lundberg, Alexey Stern, Subha Puttagunta, Jing Xiong, Xiao Wu, Yash Pande, Amit Jhindal, Daniel Murphy, Jon Clark, Marc Brockschmidt, Maxine Deines, Kevin R. McKee, Dan Bahir, Jiajun Shen, Minh Truong, Daniel McDuff, Andrea Gesmundo, Edouard Rosseel, Bowen Liang, Ken Caluwaerts, Jessica Hamrick, Joseph Kready, Mary Cassin, Rishikesh Ingale, Li Lao, Scott Pollom, Yifan Ding, Wei He, Lizzetth Bellot, Joana Iljazi, Ramya Sree Boppana, Shan Han, Tara Thompson, Amr Khalifa, Anna Bulanova, Blagoj Mitrevski, Bo Pang, Emma Cooney, Tian Shi, Rey Coaguila, Tamar Yakar, Marc’aurelio Ranzato, Nikola Momchev, Chris Rawles, Zachary Charles, Young Maeng, Yuan Zhang, Rishabh Bansal, Xiaokai Zhao, Brian Albert, Yuan Yuan, Sudheendra Vijayanarasimhan, Roy Hirsch, Vinay Ramasesh, Kiran Vodrahalli, Xingyu Wang, Arushi Gupta, DJ Strouse, Jianmo Ni, Roma Patel, Gabe Taubman, Zhouyuan Huo, Dero Gharibian, Marianne Monteiro, Hoi Lam, Shobha Vasudevan, Aditi Chaudhary, Isabela Albuquerque, Kilol Gupta, Sebastian Riedel, Chaitra Hegde, Avraham Ruderman, András György, Marcus Wainwright, Ashwin Chaugule, Burcu Karagol Ayan, Tomer Levinboim, Sam Shleifer, Yogesh Kalley, Vahab Mirrokni, Abhishek Rao, Prabakar Radhakrishnan, Jay Hartford, Jialin Wu, Zhenhai Zhu, Francesco Bertolini, Hao Xiong, Nicolas Serrano, Hamish Tomlinson, Myle Ott, Yifan Chang, Mark Graham, Jian Li, Marco Liang, Xiangzhu Long, Sebastian Borgeaud, Yanif Ahmad, Alex Grills, Diana Mincu, Martin Izzard, Yuan Liu, Jinyu Xie, Louis O’Bryan, Sameera Ponda, Simon Tong, Michelle Liu, Dan Malkin, Khalid Salama, Yuankai Chen, Rohan Anil, Anand Rao, Rigel Swavely, Misha Bilenko, Nina Anderson, Tat Tan, Jing Xie, Xing Wu, Lijun Yu, Oriol Vinyals, Andrey Ryabtsev, Rumen Dangovski, Kate Baumli, Daniel Keysers, Christian Wright, Zoe Ashwood, Betty Chan, Artem Shtefan, Yaohui Guo, Ankur Bapna, Radu Soricut, Steven Pecht, Sabela Ramos, Rui Wang, Jiahao Cai, Trieu Trinh, Paul Barham, Linda Friso, Eli Stickgold, Xiangzhuo Ding, Siamak Shakeri, Diego Ardila, Eleftheria Briakou, Phil Culliton, Adam Raveret, Jingyu Cui, David Saxton, Subhrajit Roy, Javad Azizi, Pengcheng Yin, Lucia Loher, Andrew Bunner, Min Choi, Faruk Ahmed, Eric Li, Yin Li, Shengyang Dai, Michael Elabd, Sriram Ganapathy, Shivani Agrawal, Yiqing Hua, Paige Kunkle, Sujeevan Rajayogam, Arun Ahuja, Arthur Conmy, Alex Vasiloff, Parker Beak, Christopher Yew, Jayaram Mudigonda, Bartek Wydrowski, Jon Blanton, Zhengdong Wang, Yann Dauphin, Zhuo Xu, Martin Polacek, Xi Chen, Hexiang Hu, Pauline Sho, Markus Kunesch, Mehdi Hafezi Manshadi, Eliza Rutherford, Bo Li, Sissie Hsiao, Iain Barr, Alex Tudor, Matija Kecman, Arsha Nagrani, Vladimir Pchelin, Martin Sundermeyer, Aishwarya P S, Abhijit Karmarkar, Yi Gao, Grishma Chole, Olivier Bachem, Isabel Gao, Arturo BC, Matt Dibb, Mauro Verzetti, Felix Hernandez-Campos, Yana Lunts, Matthew Johnson, Julia Di Trapani, Raphael Koster, Idan Brusilovsky, Binbin Xiong, Megha Mohabey, Han Ke, Joe Zou, Tea Sabolić, Víctor Campos, John Palowitch, Alex Morris, Linhai Qiu, Pranavaraj Ponnuramu, Fangtao Li, Vivek Sharma, Kiranbir Sodhia, Kaan Tekelioglu, Aleksandr Chuklin, Madhavi Yenugula, Erika Gemzer, Theofilos Strinopoulos, Sam El-Husseini, Huiyu Wang, Yan Zhong, Edouard Leurent, Paul Natsev, Weijun Wang, Dre Mahaarachchi, Tao Zhu, Songyou Peng, Sami Alabed, Cheng-Chun Lee, Anthony Brohan, Arthur Szlam, GS Oh, Anton Kovsharov, Jenny Lee, Renee Wong, Megan Barnes, Gregory Thornton, Felix Gimeno, Omer Levy, Martin Sevenich, Melvin Johnson, Jonathan Mallinson, Robert Dadashi, Ziyue Wang, Qingchun Ren, Preethi Lahoti, Arka Dhar, Josh Feldman, Dan Zheng, Thatcher Ulrich, Liviu Panait, Michiel Blokzijl, Cip Baetu, Josip Matak, Jitendra Harlalka, Maulik Shah, Tal Marian, Daniel von Dincklage, Cosmo Du, Ruy Ley-Wild, Bethanie Brownfield, Max Schumacher, Yury Stuken, Shadi Noghabi, Sonal Gupta, Xiaoqi Ren, Eric Malmi, Felix Weissenberger, Blanca Huergo, Maria Bauza, Thomas Lampe, Arthur Douillard, Mojtaba Seyedhosseini, Roy Frostig, Zoubin Ghahramani, Kelvin Nguyen, Kashyap Krishnakumar, Chengxi Ye, Rahul Gupta, Alireza Nazari, Robert Geirhos, Pete Shaw, Ahmed Eleryan, Dima Damen, Jennimaria Palomaki, Ted Xiao, Qiyin Wu, Quan Yuan, Phoenix Meadowlark, Matthew Bilotti, Raymond Lin, Mukund Sridhar, Yannick Schroecker, Da-Woon Chung, Jincheng Luo, Trevor Strohman, Tianlin Liu, Anne Zheng, Jesse Emond, Wei Wang, Andrew Lampinen, Toshiyuki Fukuzawa, Folawiyo Campbell-Ajala, Monica Roy, James Lee-Thorp, Lily Wang, Iftekhar Naim, Tony, Nguy~ên, Guy Bensky, Aditya Gupta, Dominika Rogozińska, Justin Fu, Thanumalayan Sankaranarayana Pillai, Petar Veličković, Shahar Drath, Philipp Neubeck, Vaibhav Tulsyan, Arseniy Klimovskiy, Don Metzler, Sage Stevens, Angel Yeh, Junwei Yuan, Tianhe Yu, Kelvin Zhang, Alec Go, Vincent Tsang, Ying Xu, Andy Wan, Isaac Galatzer-Levy, Sam Sobell, Abodunrinwa Toki, Elizabeth Salesky, Wenlei Zhou, Diego Antognini, Sholto Douglas, Shimu Wu, Adam Lelkes, Frank Kim, Paul Cavallaro, Ana Salazar, Yuchi Liu, James Besley, Tiziana Refice, Yiling Jia, Zhang Li, Michal Sokolik, Arvind Kannan, Jon Simon, Jo Chick, Avia Aharon, Meet Gandhi, Mayank Daswani, Keyvan Amiri, Vighnesh Birodkar, Abe Ittycheriah, Peter Grabowski, Oscar Chang, Charles Sutton, Zhixin, Lai, Umesh Telang, Susie Sargsyan, Tao Jiang, Raphael Hoffmann, Nicole Brichtova, Matteo Hessel, Jonathan Halcrow, Sammy Jerome, Geoff Brown, Alex Tomala, Elena Buchatskaya, Dian Yu, Sachit Menon, Pol Moreno, Yuguo Liao, Vicky Zayats, Luming Tang, SQ Mah, Ashish Shenoy, Alex Siegman, Majid Hadian, Okwan Kwon, Tao Tu, Nima Khajehnouri, Ryan Foley, Parisa Haghani, Zhongru Wu, Vaishakh Keshava, Khyatti Gupta, Tony Bruguier, Rui Yao, Danny Karmon, Luisa Zintgraf, Zhicheng Wang, Enrique Piqueras, Junehyuk Jung, Jenny Brennan, Diego Machado, Marissa Giustina, MH Tessler, Kamyu Lee, Qiao Zhang, Joss Moore, Kaspar Daugaard, Alexander Frömmgen, Jennifer Beattie, Fred Zhang, Daniel Kasenberg, Ty Geri, Danfeng Qin, Gaurav Singh Tomar, Tom Ouyang, Tianli Yu, Luowei Zhou, Rajiv Mathews, Andy Davis, Yaoyiran Li, Jai Gupta, Damion Yates, Linda Deng, Elizabeth Kemp, Ga-Young Joung, Sergei Vassilvitskii, Mandy Guo, Pallavi LV, Dave Dopson, Sami Lachgar, Lara McConnaughey, Himadri Choudhury, Dragos Dena, Aaron Cohen, Joshua Ainslie, Sergey Levi, Parthasarathy Gopavarapu, Polina Zablotskaia, Hugo Vallet, Sanaz Bahargam, Xiaodan Tang, Nenad Tomasev, Ethan Dyer, Daniel Balle, Hongrae Lee, William Bono, Jorge Gonzalez Mendez, Vadim Zubov, Shentao Yang, Ivor Rendulic, Yanyan Zheng, Andrew Hogue, Golan Pundak, Ralph Leith, Avishkar Bhoopchand, Michael Han, Mislav Žanić, Tom Schaul, Manolis Delakis, Tejas Iyer, Guanyu Wang, Harman Singh, Abdelrahman Abdelhamed, Tara Thomas, Siddhartha Brahma, Hilal Dib, Naveen Kumar, Wenxuan Zhou, Liang Bai, Pushkar Mishra, Jiao Sun, Valentin Anklin, Roykrong Sukkerd, Lauren Agubuzu, Anton Briukhov, Anmol Gulati, Maximilian Sieb, Fabio Pardo, Sara Nasso, Junquan Chen, Kexin Zhu, Tiberiu Sosea, Alex Goldin, Keith Rush, Spurthi Amba Hombaiah, Andreas Noever, Allan Zhou, Sam Haves, Mary Phuong, Jake Ades, Yi-ting Chen, Lin Yang, Joseph Pagadora, Stan Bileschi, Victor Cotruta, Rachel Saputro, Arijit Pramanik, Sean Ammirati, Dan Garrette, Kevin Villela, Tim Blyth, Canfer Akbulut, Neha Jha, Alban Rrustemi, Arissa Wongpanich, Chirag Nagpal, Yonghui Wu, Morgane Rivière, Sergey Kishchenko, Pranesh Srinivasan, Alice Chen, Animesh Sinha, Trang Pham, Bill Jia, Tom Hennigan, Anton Bakalov, Nithya Attaluri, Drew Garmon, Daniel Rodriguez, Dawid Wegner, Wenhao Jia, Evan Senter, Noah Fiedel, Denis Petek, Yuchuan Liu, Cassidy Hardin, Harshal Tushar Lehri, Joao Carreira, Sara Smoot, Marcel Prasetya, Nami Akazawa, Anca Stefanoiu, Chia-Hua Ho, Anelia Angelova, Kate Lin, Min Kim, Charles Chen, Marcin Sieniek, Alice Li, Tongfei Guo, Sorin Baltateanu, Pouya Tafti, Michael Wunder, Nadav Olmert, Divyansh Shukla, Jingwei Shen, Neel Kovelamudi, Balaji Venkatraman, Seth Neel, Romal Thoppilan, Jerome Connor, Frederik Benzing, Axel Stjerngren, Golnaz Ghiasi, Alex Polozov, Joshua Howland, Theophane Weber, Justin Chiu, Ganesh Poomal Girirajan, Andreas Terzis, Pidong Wang, Fangda Li, Yoav Ben Shalom, Dinesh Tewari, Matthew Denton, Roee Aharoni, Norbert Kalb, Heri Zhao, Junlin Zhang, Angelos Filos, Matthew Rahtz, Lalit Jain, Connie Fan, Vitor Rodrigues, Ruth Wang, Richard Shin, Jacob Austin, Roman Ring, Mariella Sanchez-Vargas, Mehadi Hassen, Ido Kessler, Uri Alon, Gufeng Zhang, Wenhu Chen, Yenai Ma, Xiance Si, Le Hou, Azalia Mirhoseini, Marc Wilson, Geoff Bacon, Becca Roelofs, Lei Shu, Gautam Vasudevan, Jonas Adler, Artur Dwornik, Tayfun Terzi, Matt Lawlor, Harry Askham, Mike Bernico, Xuanyi Dong, Chris Hidey, Kevin Kilgour, Gaël Liu, Surya Bhupatiraju, Luke Leonhard, Siqi Zuo, Partha Talukdar, Qing Wei, Aliaksei Severyn, Vít Listík, Jong Lee, Aditya Tripathi, SK Park, Yossi Matias, Hao Liu, Alex Ruiz, Rajesh Jayaram, Jackson Tolins, Pierre Marcenac, Yiming Wang, Bryan Seybold, Henry Prior, Deepak Sharma, Jack Weber, Mikhail Sirotenko, Yunhsuan Sung, Dayou Du, Ellie Pavlick, Stefan Zinke, Markus Freitag, Max Dylla, Montse Gonzalez Arenas, Natan Potikha, Omer Goldman, Connie Tao, Rachita Chhaparia, Maria Voitovich, Pawan Dogra, Andrija Ražnatović, Zak Tsai, Chong You, Oleaser Johnson, George Tucker, Chenjie Gu, Jae Yoo, Maryam Majzoubi, Valentin Gabeur, Bahram Raad, Rocky Rhodes, Kashyap Kolipaka, Heidi Howard, Geta Sampemane, Benny Li, Chulayuth Asawaroengchai, Duy Nguyen, Chiyuan Zhang, Timothee Cour, Xinxin Yu, Zhao Fu, Joe Jiang, Po-Sen Huang, Gabriela Surita, Iñaki Iturrate, Yael Karov, Michael Collins, Martin Baeuml, Fabian Fuchs, Shilpa Shetty, Swaroop Ramaswamy, Sayna Ebrahimi, Qiuchen Guo, Jeremy Shar, Gabe Barth-Maron, Sravanti Addepalli, Bryan Richter, Chin-Yi Cheng, Eugénie Rives, Fei Zheng, Johannes Griesser, Nishanth Dikkala, Yoel Zeldes, Ilkin Safarli, Dipanjan Das, Himanshu Srivastava, Sadh MNM Khan, Xin Li, Aditya Pandey, Larisa Markeeva, Dan Belov, Qiqi Yan, Mikołaj Rybiński, Tao Chen, Megha Nawhal, Michael Quinn, Vineetha Govindaraj, Sarah York, Reed Roberts, Roopal Garg, Namrata Godbole, Jake Abernethy, Anil Das, Lam Nguyen Thiet, Jonathan Tompson, John Nham, Neera Vats, Ben Caine, Wesley Helmholz, Francesco Pongetti, Yeongil Ko, James An, Clara Huiyi Hu, Yu-Cheng Ling, Julia Pawar, Robert Leland, Keisuke Kinoshita, Waleed Khawaja, Marco Selvi, Eugene Ie, Danila Sinopalnikov, Lev Proleev, Nilesh Tripuraneni, Michele Bevilacqua, Seungji Lee, Clayton Sanford, Dan Suh, Dustin Tran, Jeff Dean, Simon Baumgartner, Jens Heitkaemper, Sagar Gubbi, Kristina Toutanova, Yichong Xu, Chandu Thekkath, Keran Rong, Palak Jain, Annie Xie, Yan Virin, Yang Li, Lubo Litchev, Richard Powell, Tarun Bharti, Adam Kraft, Nan Hua, Marissa Ikonomidis, Ayal Hitron, Sanjiv Kumar, Loic Matthey, Sophie Bridgers, Lauren Lax, Ishaan Malhi, Ondrej Skopek, Ashish Gupta, Jiawei Cao, Mitchelle Rasquinha, Siim Põder, Wojciech Stokowiec, Nicholas Roth, Guowang Li, Michaël Sander, Joshua Kessinger, Vihan Jain, Edward Loper, Wonpyo Park, Michal Yarom, Liqun Cheng, Guru Guruganesh, Kanishka Rao, Yan Li, Catarina Barros, Mikhail Sushkov, Chun-Sung Ferng, Rohin Shah, Ophir Aharoni, Ravin Kumar, Tim McConnell, Peiran Li, Chen Wang, Fernando Pereira, Craig Swanson, Fayaz Jamil, Yan Xiong, Anitha Vijayakumar, Prakash Shroff, Kedar Soparkar, Jindong Gu, Livio Baldini Soares, Eric Wang, Kushal Majmundar, Aurora Wei, Kai Bailey, Nora Kassner, Chizu Kawamoto, Goran Žužić, Victor Gomes, Abhirut Gupta, Michael Guzman, Ishita Dasgupta, Xinyi Bai, Zhufeng Pan, Francesco Piccinno, Hadas Natalie Vogel, Octavio Ponce, Adrian Hutter, Paul Chang, Pan-Pan Jiang, Ionel Gog, Vlad Ionescu, James Manyika, Fabian Pedregosa, Harry Ragan, Zach Behrman, Ryan Mullins, Coline Devin, Aroonalok Pyne, Swapnil Gawde, Martin Chadwick, Yiming Gu, Sasan Tavakkol, Andy Twigg, Naman Goyal, Ndidi Elue, Anna Goldie, Srinivasan Venkatachary, Hongliang Fei, Ziqiang Feng, Marvin Ritter, Isabel Leal, Sudeep Dasari, Pei Sun, Alif Raditya Rochman, Brendan O’Donoghue, Yuchen Liu, Jim Sproch, Kai Chen, Natalie Clay, Slav Petrov, Sailesh Sidhwani, Ioana Mihailescu, Alex Panagopoulos, AJ Piergiovanni, Yunfei Bai, George Powell, Deep Karkhanis, Trevor Yacovone, Petr Mitrichev, Joe Kovac, Dave Uthus, Amir Yazdanbakhsh, David Amos, Steven Zheng, Bing Zhang, Jin Miao, Bhuvana Ramabhadran, Soroush Radpour, Shantanu Thakoor, Josh Newlan, Oran Lang, Orion Jankowski, Shikhar Bharadwaj, Jean-Michel Sarr, Shereen Ashraf, Sneha Mondal, Jun Yan, Ankit Singh Rawat, Sarmishta Velury, Greg Kochanski, Tom Eccles, Franz Och, Abhanshu Sharma, Ethan Mahintorabi, Alex Gurney, Carrie Muir, Vered Cohen, Saksham Thakur, Adam Bloniarz, Asier Mujika, Alexander Pritzel, Paul Caron, Altaf Rahman, Fiona Lang, Yasumasa Onoe, Petar Sirkovic, Jay Hoover, Ying Jian, Pablo Duque, Arun Narayanan, David Soergel, Alex Haig, Loren Maggiore, Shyamal Buch, Josef Dean, Ilya Figotin, Igor Karpov, Shaleen Gupta, Denny Zhou, Muhuan Huang, Ashwin Vaswani, Christopher Semturs, Kaushik Shivakumar, Yu Watanabe, Vinodh Kumar Rajendran, Eva Lu, Yanhan Hou, Wenting Ye, Shikhar Vashishth, Nana Nti, Vytenis Sakenas, Darren Ni, Doug DeCarlo, Michael Bendersky, Sumit Bagri, Nacho Cano, Elijah Peake, Simon Tokumine, Varun Godbole, Carlos Guía, Tanya Lando, Vittorio Selo, Seher Ellis, Danny Tarlow, Daniel Gillick, Alessandro Epasto, Siddhartha Reddy Jonnalagadda, Meng Wei, Meiyan Xie, Ankur Taly, Michela Paganini, Mukund Sundararajan, Daniel Toyama, Ting Yu, Dessie Petrova, Aneesh Pappu, Rohan Agrawal, Senaka Buthpitiya, Justin Frye, Thomas Buschmann, Remi Crocker, Marco Tagliasacchi, Mengchao Wang, Da Huang, Sagi Perel, Brian Wieder, Hideto Kazawa, Weiyue Wang, Jeremy Cole, Himanshu Gupta, Ben Golan, Seojin Bang, Nitish Kulkarni, Ken Franko, Casper Liu, Doug Reid, Sid Dalmia, Jay Whang, Kevin Cen, Prasha Sundaram, Johan Ferret, Berivan Isik, Lucian Ionita, Guan Sun, Anna Shekhawat, Muqthar Mohammad, Philip Pham, Ronny Huang, Karthik Raman, Xingyi Zhou, Ross Mcilroy, Austin Myers, Sheng Peng, Jacob Scott, Paul Covington, Sofia Erell, Pratik Joshi, João Gabriel Oliveira, Natasha Noy, Tajwar Nasir, Jake Walker, Vera Axelrod, Tim Dozat, Pu Han, Chun-Te Chu, Eugene Weinstein, Anand Shukla, Shreyas Chandrakaladharan, Petra Poklukar, Bonnie Li, Ye Jin, Prem Eruvbetine, Steven Hansen, Avigail Dabush, Alon Jacovi, Samrat Phatale, Chen Zhu, Steven Baker, Mo Shomrat, Yang Xiao, Jean Pouget-Abadie, Mingyang Zhang, Fanny Wei, Yang Song, Helen King, Yiling Huang, Yun Zhu, Ruoxi Sun, Juliana Vicente Franco, Chu-Cheng Lin, Sho Arora, Hui, Li, Vivian Xia, Luke Vilnis, Mariano Schain, Kaiz Alarakyia, Laurel Prince, Aaron Phillips, Caleb Habtegebriel, Luyao Xu, Huan Gui, Santiago Ontanon, Lora Aroyo, Karan Gill, Peggy Lu, Yash Katariya, Dhruv Madeka, Shankar Krishnan, Shubha Srinivas Raghvendra, James Freedman, Yi Tay, Gaurav Menghani, Peter Choy, Nishita Shetty, Dan Abolafia, Doron Kukliansky, Edward Chou, Jared Lichtarge, Ken Burke, Ben Coleman, Dee Guo, Larry Jin, Indro Bhattacharya, Victoria Langston, Yiming Li, Suyog Kotecha, Alex Yakubovich, Xinyun Chen, Petre Petrov, Tolly Powell, Yanzhang He, Corbin Quick, Kanav Garg, Dawsen Hwang, Yang Lu, Srinadh Bhojanapalli, Kristian Kjems, Ramin Mehran, Aaron Archer, Hado van Hasselt, Ashwin Balakrishna, JK Kearns, Meiqi Guo, Jason Riesa, Mikita Sazanovich, Xu Gao, Chris Sauer, Chengrun Yang, XiangHai Sheng, Thomas Jimma, Wouter Van Gansbeke, Vitaly Nikolaev, Wei Wei, Katie Millican, Ruizhe Zhao, Justin Snyder, Levent Bolelli, Maura O’Brien, Shawn Xu, Fei Xia, Wentao Yuan, Arvind Neelakantan, David Barker, Sachin Yadav, Hannah Kirkwood, Farooq Ahmad, Joel Wee, Jordan Grimstad, Boyu Wang, Matthew Wiethoff, Shane Settle, Miaosen Wang, Charles Blundell, Jingjing Chen, Chris Duvarney, Grace Hu, Olaf Ronneberger, Alex Lee, Yuanzhen Li, Abhishek Chakladar, Alena Butryna, Georgios Evangelopoulos, Guillaume Desjardins, Jonni Kanerva, Henry Wang, Averi Nowak, Nick Li, Alyssa Loo, Art Khurshudov, Laurent El Shafey, Nagabhushan Baddi, Karel Lenc, Yasaman Razeghi, Tom Lieber, Amer Sinha, Xiao Ma, Yao Su, James Huang, Asahi Ushio, Hanna Klimczak-Plucińska, Kareem Mohamed, JD Chen, Simon Osindero, Stav Ginzburg, Lampros Lamprou, Vasilisa Bashlovkina, Duc-Hieu Tran, Ali Khodaei, Ankit Anand, Yixian Di, Ramy Eskander, Manish Reddy Vuyyuru, Jasmine Liu, Aishwarya Kamath, Roman Goldenberg, Mathias Bellaiche, Juliette Pluto, Bill Rosgen, Hassan Mansoor, William Wong, Suhas Ganesh, Eric Bailey, Scott Baird, Dan Deutsch, Jinoo Baek, Xuhui Jia, Chansoo Lee, Abe Friesen, Nathaniel Braun, Kate Lee, Amayika Panda, Steven M. Hernandez, Duncan Williams, Jianqiao Liu, Ethan Liang, Arnaud Autef, Emily Pitler, Deepali Jain, Phoebe Kirk, Oskar Bunyan, Jaume Sanchez Elias, Tongxin Yin, Machel Reid, Aedan Pope, Nikita Putikhin, Bidisha Samanta, Sergio Guadarrama, Dahun Kim, Simon Rowe, Marcella Valentine, Geng Yan, Alex Salcianu, David Silver, Gan Song, Richa Singh, Shuai Ye, Hannah DeBalsi, Majd Al Merey, Eran Ofek, Albert Webson, Shibl Mourad, Ashwin Kakarla, Silvio Lattanzi, Nick Roy, Evgeny Sluzhaev, Christina Butterfield, Alessio Tonioni, Nathan Waters, Sudhindra Kopalle, Jason Chase, James Cohan, Girish Ramchandra Rao, Robert Berry, Michael Voznesensky, Shuguang Hu, Kristen Chiafullo, Sharat Chikkerur, George Scrivener, Ivy Zheng, Jeremy Wiesner, Wolfgang Macherey, Timothy Lillicrap, Fei Liu, Brian Walker, David Welling, Elinor Davies, Yangsibo Huang, Lijie Ren, Nir Shabat, Alessandro Agostini, Mariko Iinuma, Dustin Zelle, Rohit Sathyanarayana, Andrea D’olimpio, Morgan Redshaw, Matt Ginsberg, Ashwin Murthy, Mark Geller, Tatiana Matejovicova, Ayan Chakrabarti, Ryan Julian, Christine Chan, Qiong Hu, Daniel Jarrett, Manu Agarwal, Jeshwanth Challagundla, Tao Li, Sandeep Tata, Wen Ding, Maya Meng, Zhuyun Dai, Giulia Vezzani, Shefali Garg, Jannis Bulian, Mary Jasarevic, Honglong Cai, Harish Rajamani, Adam Santoro, Florian Hartmann, Chen Liang, Bartek Perz, Apoorv Jindal, Fan Bu, Sungyong Seo, Ryan Poplin, Adrian Goedeckemeyer, Badih Ghazi, Nikhil Khadke, Leon Liu, Kevin Mather, Mingda Zhang, Ali Shah, Alex Chen, Jinliang Wei, Keshav Shivam, Yuan Cao, Donghyun Cho, Angelo Scorza Scarpati, Michael Moffitt, Clara Barbu, Ivan Jurin, Ming-Wei Chang, Hongbin Liu, Hao Zheng, Shachi Dave, Christine Kaeser-Chen, Xiaobin Yu, Alvin Abdagic, Lucas Gonzalez, Yanping Huang, Peilin Zhong, Cordelia Schmid, Bryce Petrini, Alex Wertheim, Jifan Zhu, Hoang Nguyen, Kaiyang Ji, Yanqi Zhou, Tao Zhou, Fangxiaoyu Feng, Regev Cohen, David Rim, Shubham Milind Phal, Petko Georgiev, Ariel Brand, Yue Ma, Wei Li, Somit Gupta, Chao Wang, Pavel Dubov, Jean Tarbouriech, Kingshuk Majumder, Huijian Li, Norman Rink, Apurv Suman, Yang Guo, Yinghao Sun, Arun Nair, Xiaowei Xu, Mohamed Elhawaty, Rodrigo Cabrera, Guangxing Han, Julian Eisenschlos, Junwen Bai, Yuqi Li, Yamini Bansal, Thibault Sellam, Mina Khan, Hung Nguyen, Justin Mao-Jones, Nikos Parotsidis, Jake Marcus, Cindy Fan, Roland Zimmermann, Yony Kochinski, Laura Graesser, Feryal Behbahani, Alvaro Caceres, Michael Riley, Patrick Kane, Sandra Lefdal, Rob Willoughby, Paul Vicol, Lun Wang, Shujian Zhang, Ashleah Gill, Yu Liang, Gautam Prasad, Soroosh Mariooryad, Mehran Kazemi, Zifeng Wang, Kritika Muralidharan, Paul Voigtlaender, Jeffrey Zhao, Huanjie Zhou, Nina D’Souza, Aditi Mavalankar, Séb Arnold, Nick Young, Obaid Sarvana, Chace Lee, Milad Nasr, Tingting Zou, Seokhwan Kim, Lukas Haas, Kaushal Patel, Neslihan Bulut, David Parkinson, Courtney Biles, Dmitry Kalashnikov, Chi Ming To, Aviral Kumar, Jessica Austin, Alex Greve, Lei Zhang, Megha Goel, Yeqing Li, Sergey Yaroshenko, Max Chang, Abhishek Jindal, Geoff Clark, Hagai Taitelbaum, Dale Johnson, Ofir Roval, Jeongwoo Ko, Anhad Mohananey, Christian Schuler, Shenil Dodhia, Ruichao Li, Kazuki Osawa, Claire Cui, Peng Xu, Rushin Shah, Tao Huang, Ela Gruzewska, Nathan Clement, Mudit Verma, Olcan Sercinoglu, Hai Qian, Viral Shah, Masa Yamaguchi, Abhinit Modi, Takahiro Kosakai, Thomas Strohmann, Junhao Zeng, Beliz Gunel, Jun Qian, Austin Tarango, Krzysztof Jastrzębski, Robert David, Jyn Shan, Parker Schuh, Kunal Lad, Willi Gierke, Mukundan Madhavan, Xinyi Chen, Mark Kurzeja, Rebeca Santamaria-Fernandez, Dawn Chen, Alexandra Cordell, Yuri Chervonyi, Frankie Garcia, Nithish Kannen, Vincent Perot, Nan Ding, Shlomi Cohen-Ganor, Victor Lavrenko, Junru Wu, Georgie Evans, Cicero Nogueira dos Santos, Madhavi Sewak, Ashley Brown, Andrew Hard, Joan Puigcerver, Zeyu Zheng, Yizhong Liang, Evgeny Gladchenko, Reeve Ingle, Uri First, Pierre Sermanet, Charlotte Magister, Mihajlo Velimirović, Sashank Reddi, Susanna Ricco, Eirikur Agustsson, Hartwig Adam, Nir Levine, David Gaddy, Dan Holtmann-Rice, Xuanhui Wang, Ashutosh Sathe, Abhijit Guha Roy, Blaž Bratanič, Alen Carin, Harsh Mehta, Silvano Bonacina, Nicola De Cao, Mara Finkelstein, Verena Rieser, Xinyi Wu, Florent Altché, Dylan Scandinaro, Li Li, Nino Vieillard, Nikhil Sethi, Garrett Tanzer, Zhi Xing, Shibo Wang, Parul Bhatia, Gui Citovsky, Thomas Anthony, Sharon Lin, Tianze Shi, Shoshana Jakobovits, Gena Gibson, Raj Apte, Lisa Lee, Mingqing Chen, Arunkumar Byravan, Petros Maniatis, Kellie Webster, Andrew Dai, Pu-Chin Chen, Jiaqi Pan, Asya Fadeeva, Zach Gleicher, Thang Luong, Niket Kumar Bhumihar

Main category: cs.CL

TL;DR: Google introduces the Gemini 2.X model family (2.5 Pro, 2.5 Flash, 2.0 Flash, Flash-Lite) that achieves state-of-the-art performance in coding and reasoning while offering different capability-cost trade-offs across the Pareto frontier.

Details

Motivation: To develop a comprehensive model family that spans the full spectrum of capability versus cost trade-offs, enabling users to choose optimal models for different use cases while pushing the boundaries of agentic problem solving.

Method: Development of multiple model variants within the Gemini 2.X family: Gemini 2.5 Pro (flagship model with maximum capabilities), Gemini 2.5 Flash (balanced reasoning with reduced compute), and Gemini 2.0 Flash/Flash-Lite (optimized for speed and cost efficiency).

Result: Gemini 2.5 Pro achieves state-of-the-art performance on frontier coding and reasoning benchmarks, supports up to 3 hours of video processing, and demonstrates advanced multimodal understanding. The model family successfully covers the complete capability-cost Pareto frontier.

Conclusion: The Gemini 2.X model family successfully provides a comprehensive range of AI models that balance capability, cost, and latency requirements, enabling new agentic workflows through their combined long context, multimodal, and reasoning capabilities.

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

[75] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan

Main category: cs.CL

TL;DR: SAGE is a Vision-Language Model framework that improves industrial anomaly detection through Self-Guided Fact Enhancement and Entropy-aware Direct Preference Optimization, achieving superior performance in zero-shot and one-shot settings.

Details

Motivation: Existing Vision-Language Models struggle with industrial anomaly detection and reasoning, particularly in providing interpretable explanations and generalizing to unseen categories due to the domain-specific nature of anomaly detection, which limits their applicability in industrial scenarios requiring precise, structured, and context-aware analysis.

Method: The paper proposes SAGE framework with two key components: (1) Self-Guided Fact Enhancement (SFE) that integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, and (2) Entropy-aware Direct Preference Optimization (E-DPO) that aligns model outputs with expert preferences using entropy-aware optimization. Additionally, they introduce AD-PL dataset with 28,415 question-answering instances and Multiscale Logical Evaluation (MLE) framework for evaluation.

Result: SAGE demonstrates superior performance on industrial anomaly datasets under both zero-shot and one-shot settings compared to existing approaches. The framework successfully enhances anomaly reasoning capabilities in industrial contexts.

Conclusion: The SAGE framework effectively addresses the limitations of existing VLMs in industrial anomaly detection by incorporating domain-specific knowledge and expert preferences, providing better interpretable explanations and generalization to unseen categories in industrial anomaly reasoning tasks.

Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

[76] Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Sangjee Dondrub, Caizang Tai, Haixing Zhao, Huaque Cairang, Suonan Cairang, Rou Te, Lengben Zhaxi, Gazang Zhaxi, Zhonglin Ye, Yuhui Zheng, Chunyan Peng, Secha Jia, Pema Tashi, Cizhen Jiacuo, Pema Dorjee, Hongkai Liu, Pema Yanggon, Tsehang Dorjee, Jiaxin Han, Qiongying Hu, Jilin Man, Huanke You, Yuqi Ren, Duo La, Deyi Xiong

Main category: cs.CL

TL;DR: Researchers developed Banzhida, a multilingual large language model specifically enhanced for Tibetan language processing, using the largest curated Tibetan pre-training corpus to date and demonstrating superior performance on Tibetan language tasks compared to existing models.

Details

Motivation: Tibetan is severely underrepresented in existing large language models due to the scarcity of high-quality training corpora, creating a significant gap for this low-resource language in generative AI applications.

Method: The researchers curated the largest Tibetan pre-training corpus by aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan, then continued pre/post-training a multilingual base model to create Banzhida.

Result: Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks, as demonstrated through evaluation on newly created high-quality Tibetan benchmarks and existing public benchmarks.

Conclusion: The study successfully addresses the underrepresentation of Tibetan in large language models by creating Banzhida, which advances generative AI capabilities for Tibetan language through careful corpus curation and model training, establishing new performance benchmarks for Tibetan language processing.

Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

[77] A Survey of Deep Learning for Geometry Problem Solving

Jianzhe Ma, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of deep learning applications in geometry problem solving, covering tasks, methods, evaluation metrics, and future directions in this rapidly developing field.

Details

Motivation: Geometry problem solving is crucial for education, AI mathematical ability assessment, and multimodal capability evaluation. The rapid advancement of deep learning, particularly multimodal large language models, has created significant research interest requiring a systematic review of the field.

Method: The authors conduct a comprehensive survey methodology that includes: (1) summarizing relevant geometry problem solving tasks, (2) reviewing deep learning methods applied to these problems, (3) analyzing evaluation metrics and methodologies, and (4) discussing current challenges and future research directions.

Result: The paper delivers a systematic overview of deep learning applications in geometry problem solving, providing researchers with a structured understanding of existing approaches, evaluation standards, and identifies key areas for future development. A continuously updated GitHub repository is maintained for ongoing reference.

Conclusion: The survey serves as a comprehensive reference for researchers working on deep learning approaches to geometry problem solving, aiming to accelerate progress in this interdisciplinary field by providing clear insights into current methodologies, challenges, and promising future research directions.

Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

[78] Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-X is a 7B parameter open-source LLM family that achieves state-of-the-art multilingual translation performance across 28 languages by combining diverse pre-training data, Chain-of-Thought reasoning, and reinforcement learning, matching the quality of much larger closed-source models like GPT-4o and Gemini-2.5.

Details

Motivation: Multilingual translation remains challenging for large language models due to intricate language patterns and stilted automated translations. There is a need for open-source models that can handle diverse language pairs effectively while maintaining high translation quality comparable to closed-source alternatives.

Method: The authors develop Seed-X through a three-stage approach: (1) pre-training a base model on diverse, high-quality monolingual and bilingual content across 28 languages, (2) fine-tuning an instruct model using Chain-of-Thought (CoT) reasoning for translation tasks, and (3) further enhancement through reinforcement learning to improve generalization across different language pairs.

Result: Seed-X achieves performance comparable to leading closed-source models including Gemini-2.5 and GPT-4o across 28 languages, while significantly outperforming larger open-source models in both automatic metrics and human evaluations, despite having only 7B parameters.

Conclusion: The paper demonstrates that a relatively small 7B parameter model can achieve state-of-the-art multilingual translation performance through strategic training approaches combining diverse multilingual data, reasoning capabilities, and reinforcement learning. The authors make their model parameters publicly available to advance translation research.

Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

[79] Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Dina Demner-Fushman

Main category: cs.CL

TL;DR: The PLABA track at TREC 2023-2024 evaluated language models’ ability to simplify biomedical abstracts for patients, finding that top LLMs achieved human-level accuracy but struggled with simplicity and brevity, while automatic evaluation metrics poorly correlated with expert judgments.

Details

Motivation: Recent language models show potential to adapt professional biomedical literature to plain language for patients and caregivers, but their unpredictability and high potential for harm in healthcare requires rigorous evaluation to stimulate research and assess promising systems.

Method: Hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at TREC 2023-2024 with two tasks: complete sentence-level abstract rewriting (Task 1) and identifying/replacing difficult terms (Task 2). Developed four-fold professionally-written references for automatic evaluation and conducted extensive manual evaluation by biomedical experts.

Result: Twelve international teams participated using models from multilayer perceptrons to large transformers. Top models in Task 1 achieved human-level factual accuracy and completeness but failed in simplicity and brevity. Automatic metrics correlated poorly with manual judgments. Task 2 systems struggled with term identification and classification but LLM-based systems performed well in replacement generation for accuracy, completeness, and simplicity (not brevity).

Conclusion: The PLABA track demonstrated promise for using Large Language Models to adapt biomedical literature for the general public while highlighting their deficiencies in simplicity/brevity and revealing the need for improved automatic benchmarking tools in this critical healthcare communication domain.

Abstract: Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

[80] Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Main category: cs.CL

TL;DR: Promptomatix is an automatic prompt optimization framework that converts natural language task descriptions into high-quality prompts without manual tuning, achieving competitive performance while reducing computational overhead.

Details

Motivation: Prompt engineering for Large Language Models remains manual, inconsistent, and inaccessible to non-experts, creating a barrier to effective LLM utilization despite the critical importance of well-crafted prompts for optimal performance.

Method: The framework uses a modular design with both lightweight meta-prompt-based optimizer and DSPy-powered compiler. It analyzes user intent, generates synthetic training data, selects appropriate prompting strategies, and refines prompts using cost-aware objectives.

Result: Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries while reducing prompt length and computational overhead, making prompt optimization more scalable and efficient.

Conclusion: Promptomatix successfully automates prompt optimization, making it accessible to non-experts while maintaining high performance and efficiency, with its modular design enabling future extensions to more advanced frameworks.

Abstract: Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

[81] X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

Xiaolin Yan, Yangxing Liu, Jiazhang Zheng, Chi Liu, Mingyu Du, Caisheng Chen, Haoyang Liu, Ming Ding, Yuan Li, Qiuping Liao, Linfeng Li, Zhili Mei, Siyu Wan, Li Li, Ruyi Zhong, Jiangling Yu, Xule Liu, Huihui Hu, Jiameng Yue, Ruohui Cheng, Qi Yang, Liangqing Wu, Ke Zhu, Chi Zhang, Chufei Jing, Yifan Zhou, Yan Liang, Dongdong Li, Zhaohui Wang, Bin Zhao, Mingzhou Wu, Mingzhong Zhou, Peng Du, Zuomin Liao, Chao Dai, Pengfei Liang, Xiaoguang Zhu, Yu Zhang, Yu Gu, Kun Pan, Yuan Wu, Yanqing Guan, Shaojing Wu, Zikang Feng, Xianze Ma, Peishan Cheng, Wenjuan Jiang, Jing Ba, Huihao Yu, Zeping Hu, Yuan Xu, Zhiwei Liu, He Wang, Zhenguo Lin, Ming Liu, Yanhong Meng

Main category: cs.CL

TL;DR: X-Intelligence 3.0 is a 32B parameter language model specifically designed for the semiconductor display industry that outperforms much larger models like DeepSeek-R1-671B through domain-specific training, supervised fine-tuning, reinforcement learning, and retrieval-augmented generation.

Details

Motivation: Large language models lack domain-specific training and expertise for the semiconductor display industry, limiting their effectiveness in solving complex industry challenges that require specialized knowledge and reasoning capabilities.

Method: The authors developed X-Intelligence 3.0 using: (1) a carefully curated industry knowledge base, (2) supervised fine-tuning and reinforcement learning to enhance reasoning capabilities, (3) an automated evaluation framework simulating expert assessments, and (4) domain-specific retrieval-augmented generation (RAG) mechanism integration.

Result: X-Intelligence 3.0, despite having only 32 billion parameters, outperforms the much larger SOTA DeepSeek-R1-671B model across multiple evaluations, demonstrating notable performance gains on benchmark datasets and exceptional efficiency.

Conclusion: X-Intelligence 3.0 successfully addresses the longstanding reasoning challenges in the semiconductor display industry by combining domain-specific knowledge with advanced training techniques, proving that smaller, specialized models can outperform larger general-purpose models in specific domains.

Abstract: Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry’s complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.

[82] Mangosteen: An Open Thai Corpus for Language Model Pretraining

Wannaphong Phatthiyaphaibun, Can Udomcharoenchaikit, Pakpoom Singkorapoom, Kunat Pipatanakul, Ekapol Chuangsuwanich, Peerat Limkonchotiwat, Sarana Nutanong

Main category: cs.CL

TL;DR: Researchers created Mangosteen, a 47 billion-token high-quality Thai corpus using a customized cleaning pipeline, demonstrating significant improvements in Thai language model performance and releasing all resources for reproducibility.

Details

Motivation: Existing large-scale corpora use English-centric or language-agnostic cleaning pipelines that fail to capture Thai script nuances and cultural content (like gambling material), while prior Thai-specific efforts lack transparency and reproducibility in their data construction processes.

Method: Developed a Thai-adapted Dolma pipeline featuring custom rule-based language identification, revised C4/Gopher quality filters, Thai-trained content filters, and incorporation of curated non-web sources including Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles.

Result: The pipeline reduced CommonCrawl from 202M to 25M documents while improving SEA-HELM NLG scores from 3 to 11. An 8B-parameter SEA-LION model continually pre-trained on Mangosteen outperformed SEA-LION-v3 and Llama-3.1 by approximately four points on Thai benchmarks.

Conclusion: Mangosteen provides a transparent, high-quality Thai corpus with full reproducibility through released pipeline code, cleaning manifests, corpus snapshot, and model checkpoints, establishing a solid foundation for future Thai and regional language model research.

Abstract: Pre-training data shapes a language model’s quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.

[83] Supernova: Achieving More with Less in Transformer Architectures

Andrei-Valentin Tanase, Elena Pelican

Main category: cs.CL

TL;DR: Supernova is a 650M-parameter transformer that achieves 90% performance of 1B-parameter models through efficient architecture design and advanced tokenization, using 35% fewer parameters and 10x fewer training tokens than competitors.

Details

Motivation: Challenge the prevailing scaling paradigm in language models by demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts, achieving comparable performance with significantly fewer computational resources.

Method: Combines multiple architectural innovations: Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with 3:1 compression ratio, RMSNorm for efficiency, SwiGLU activation functions, and a custom 128,000-vocabulary byte-level BPE tokenizer for state-of-the-art compression performance.

Result: Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens (an order of magnitude less than competing models), demonstrating superior computational efficiency.

Conclusion: Architectural efficiency and tokenization quality can effectively compensate for reduced parameter counts, challenging the current scaling paradigm and showing that smaller, well-designed models can achieve competitive performance with significantly lower computational requirements.

Abstract: We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens–an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

cs.CV

[84] Salience Adjustment for Context-Based Emotion Recognition

Bin Han, Jonathan Gratch

Main category: cs.CV

TL;DR: This paper proposes a salience-adjusted framework that combines Bayesian Cue Integration and Visual-Language Models to improve emotion recognition by dynamically weighting facial expressions and contextual cues based on facial expressivity, evaluated in prisoner’s dilemma scenarios.

Details

Motivation: Emotion recognition in dynamic social contexts is challenging because it requires understanding the complex interaction between facial expressions and situational cues. Current approaches may not adequately balance the importance of facial versus contextual information based on the expressivity of facial cues.

Method: The authors developed a salience-adjusted framework that integrates Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. The approach adjusts the relative importance of different information sources depending on how expressive the facial cues are.

Result: The salience adjustment approach enhanced emotion recognition performance when evaluated using both human annotations and automatic emotion recognition systems in prisoner’s dilemma scenarios designed to evoke emotional reactions.

Conclusion: Incorporating salience adjustment improves emotion recognition performance and offers promising directions for extending this framework to broader social contexts and multimodal applications in future research.

Abstract: Emotion recognition in dynamic social contexts requires an understanding of the complex interaction between facial expressions and situational cues. This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner’s dilemma scenarios, which are designed to evoke emotional reactions. Our findings demonstrate that incorporating salience adjustment enhances emotion recognition performance, offering promising directions for future research to extend this framework to broader social contexts and multimodal applications.

[85] Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

Main category: cs.CV

TL;DR: The paper introduces Document Haystack, a comprehensive benchmark with 400 document variants (5-200 pages) and 8,250 questions to evaluate Vision Language Models’ performance on long, visually complex documents by strategically inserting text or multimodal “needles” at various depths.

Details

Motivation: The processing of long documents by multimodal Large Language Models remains under-explored due to a lack of suitable benchmarks, despite significant advances in analyzing complex multimodal data inputs.

Method: The authors created Document Haystack benchmark featuring documents ranging from 5 to 200 pages with strategically inserted pure text or multimodal text+image “needles” at various depths to challenge VLMs’ retrieval capabilities, supported by an objective automated evaluation framework.

Result: The benchmark comprises 400 document variants and 8,250 questions total. The paper presents evaluation results from prominent Vision Language Models on this benchmark.

Conclusion: Document Haystack provides a comprehensive evaluation framework for assessing VLMs’ performance on long documents and opens potential research avenues in this area of multimodal document understanding.

Abstract: The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image “needles” at various depths within the documents to challenge VLMs’ retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

[86] PAT++: a cautionary tale about generative visual augmentation for Object Re-identification

Leonardo Santiago Benitez Pereira, Arathy Jeevan

Main category: cs.CV

TL;DR: This paper investigates generative data augmentation for object re-identification using a novel PAT++ pipeline that incorporates Diffusion Self-Distillation into Part-Aware Transformer, finding that generated images consistently degrade performance due to domain shifts and loss of identity-defining features.

Details

Motivation: While generative data augmentation has shown success in various vision tasks, its effectiveness for object re-identification - which requires preserving fine-grained visual details crucial for identity recognition - has not been thoroughly explored, motivating this investigation.

Method: The authors propose PAT++, a pipeline that incorporates Diffusion Self-Distillation into the established Part-Aware Transformer framework. They conduct extensive experiments using the Urban Elements ReID Challenge dataset, applying generated images for both model training and query expansion to assess identity-preserving image generation effectiveness.

Result: The experiments reveal consistent performance degradation when using generated images for object re-identification tasks. The degradation is attributed to domain shifts between real and generated images and the failure of generative models to retain critical identity-defining features necessary for accurate re-identification.

Conclusion: The findings challenge existing assumptions about the transferability of generative models to fine-grained recognition tasks, particularly object re-identification. The study exposes significant limitations in current visual augmentation approaches for identity-preserving applications, suggesting that generative data augmentation may not be suitable for tasks requiring precise identity preservation.

Abstract: Generative data augmentation has demonstrated gains in several vision tasks, but its impact on object re-identification - where preserving fine-grained visual details is essential - remains largely unexplored. In this work, we assess the effectiveness of identity-preserving image generation for object re-identification. Our novel pipeline, named PAT++, incorporates Diffusion Self-Distillation into the well-established Part-Aware Transformer. Using the Urban Elements ReID Challenge dataset, we conduct extensive experiments with generated images used for both model training and query expansion. Our results show consistent performance degradation, driven by domain shifts and failure to retain identity-defining features. These findings challenge assumptions about the transferability of generative models to fine-grained recognition tasks and expose key limitations in current approaches to visual augmentation for identity-preserving applications.

[87] LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin

Main category: cs.CV

TL;DR: The paper introduces EBench-18K, a large-scale benchmark with 18K edited images and human preference annotations for evaluating text-guided image editing models, and proposes LMM4Edit, a language-vision model-based metric that aligns well with human preferences for comprehensive editing evaluation.

Details

Motivation: Current text-guided image editing models struggle to balance image quality, editing alignment, and consistency with original images, while existing evaluation benchmarks have limitations in scale and alignment with human perception, creating a need for better evaluation methods.

Method: The authors create EBench-18K benchmark containing 1,080 source images with editing prompts across 21 tasks, 18K+ edited images from 17 TIE models, 55K+ human opinion scores, and 18K+ QA pairs. They then develop LMM4Edit, a large multimodal model-based metric that evaluates editing from multiple dimensions including perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy.

Result: LMM4Edit demonstrates outstanding performance and strong alignment with human preferences in evaluation experiments. Zero-shot validation on other datasets shows good generalization ability of the proposed model.

Conclusion: The paper successfully addresses the evaluation gap in text-guided image editing by providing a comprehensive benchmark and an effective evaluation metric that better aligns with human judgment, offering valuable tools for the research community to assess and improve image editing models.

Abstract: The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs’ understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.

[88] Local Dense Logit Relations for Enhanced Knowledge Distillation

Liuchi Xu, Kang Liu, Jinshuai Liu, Lu Wang, Lisheng Xu, Jun Cheng

Main category: cs.CV

TL;DR: This paper proposes Local Dense Relational Logit Distillation (LDRLD), a knowledge distillation method that captures fine-grained inter-class relationships through recursive logit decoupling and recombining, combined with an Adaptive Decay Weight strategy to improve student model performance.

Details

Motivation: Existing logit distillation methods have not thoroughly explored fine-grained relationships within logit knowledge, missing opportunities to provide more detailed and clearer insights for student model learning from teacher models.

Method: The authors develop LDRLD which recursively decouples and recombines logit information to capture inter-class relationships, introduces Adaptive Decay Weight (ADW) strategy with Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD) to dynamically adjust weights for critical category pairs, and distills remaining non-target knowledge to ensure completeness.

Result: Extensive experiments on CIFAR-100, ImageNet-1K, and Tiny-ImageNet datasets show that LDRLD compares favorably with state-of-the-art logit-based distillation approaches, demonstrating improved student model performance through fine-grained knowledge transfer.

Conclusion: The proposed LDRLD method successfully improves knowledge distillation by capturing fine-grained inter-class relationships and emphasizing critical relationships through adaptive weighting, offering a more effective approach to logit-based knowledge transfer from teacher to student models.

Abstract: State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student’s performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.

[89] An empirical study for the early detection of Mpox from skin lesion images using pretrained CNN models leveraging XAI technique

Mohammad Asifur Rahim, Muhammad Nazmul Arefin, Md. Mizanur Rahman, Md Ali Hossain, Ahmed Moustafa

Main category: cs.CV

TL;DR: This study evaluates pre-trained CNN models (VGG16, VGG19, InceptionV3, MobileNetV2) for early mpox detection using transfer learning and Grad-CAM for interpretability, achieving up to 95% accuracy on binary classification and 93% on multi-class classification.

Details

Motivation: Mpox diagnosis is challenging due to similarities with other skin conditions, and while AI shows promise for medical image analysis, pre-trained CNN models and explainable AI techniques for mpox detection remain underexplored, creating a need for effective automated diagnostic tools.

Method: The study applied transfer learning to four pre-trained CNN models (VGG16, VGG19, InceptionV3, MobileNetV2) using MSLD and MSLD v2.0 datasets, freezing initial layers and adding custom layers to adapt features for mpox detection while avoiding overfitting, with Grad-CAM used for model interpretability and visualization of critical image regions.

Result: InceptionV3 achieved the best binary classification performance with 95% accuracy, while MobileNetV2 performed best on multi-class classification with 93% accuracy. Grad-CAM successfully highlighted key image regions, though some models showed overfitting tendencies evidenced by training-validation loss discrepancies.

Conclusion: The study demonstrates the potential of pre-trained CNN models for monkeypox detection and validates the value of XAI techniques, but identifies needs for addressing dataset limitations, incorporating multimodal data, and exploring additional interpretability techniques to improve diagnostic reliability and model transparency.

Abstract: Context: Mpox is a zoonotic disease caused by the Mpox virus, which shares similarities with other skin conditions, making accurate early diagnosis challenging. Artificial intelligence (AI), especially Deep Learning (DL), has a strong tool for medical image analysis; however, pre-trained models like CNNs and XAI techniques for mpox detection is underexplored. Objective: This study aims to evaluate the effectiveness of pre-trained CNN models (VGG16, VGG19, InceptionV3, MobileNetV2) for the early detection of monkeypox using binary and multi-class datasets. It also seeks to enhance model interpretability using Grad-CAM an XAI technique. Method: Two datasets, MSLD and MSLD v2.0, were used for training and validation. Transfer learning techniques were applied to fine-tune pre-trained CNN models by freezing initial layers and adding custom layers for adapting the final features for mpox detection task and avoid overfitting. Models performance were evaluated using metrics such as accuracy, precision, recall, F1-score and ROC. Grad-CAM was utilized for visualizing critical features. Results: InceptionV3 demonstrated the best performance on the binary dataset with an accuracy of 95%, while MobileNetV2 outperformed on the multi-class dataset with an accuracy of 93%. Grad-CAM successfully highlighted key image regions. Despite high accuracy, some models showed overfitting tendencies, as videnced by discrepancies between training and validation losses. Conclusion: This study underscores the potential of pre-trained CNN models in monkeypox detection and the value of XAI techniques. Future work should address dataset limitations, incorporate multimodal data, and explore additional interpretability techniques to improve diagnostic reliability and model transparency

[90] A Lightweight Face Quality Assessment Framework to Improve Face Verification Performance in Real-Time Screening Applications

Ahmed Aman Ibrahim, Hamad Mansour Alawar, Abdulnasser Abbas Zehi, Ahmed Mohammad Alkendi, Bilal Shafi Ashfaq Ahmed Mirza, Shan Ullah, Ismail Lujain Jaleel, Hassan Ugail

Main category: cs.CV

TL;DR: A lightweight face quality assessment framework using normalized facial landmarks and Random Forest Regression achieves 96.67% accuracy in pre-filtering low-quality face images, resulting in 99.7% reduction in false rejection rates when integrated with ArcFace face verification systems.

Details

Motivation: Low-quality face images caused by motion blur, poor lighting, occlusions, and extreme pose variations significantly degrade face recognition performance in real-time screening applications like surveillance and access control, leading to higher false rejection and false acceptance rates that need to be addressed.

Method: The framework utilizes normalized facial landmarks combined with a Random Forest Regression classifier to automatically assess face image quality and pre-filter low-quality images before they enter the verification pipeline, specifically targeting face resolution variations and pose deviations common in surveillance scenarios.

Result: The quality assessment module achieved 96.67% accuracy and when integrated with ArcFace face verification model, demonstrated a 99.7% reduction in false rejection rate with enhanced cosine similarity scores. Experiments on a real-world dataset of over 600 subjects from Dubai Police CCTV footage showed superior performance compared to existing techniques while maintaining computational efficiency.

Conclusion: The proposed lightweight framework effectively mitigates the impact of poor-quality face images in real-time screening applications, outperforming existing face quality assessment techniques while maintaining computational efficiency and specifically addressing critical challenges of face resolution variations and pose deviations in practical surveillance scenarios.

Abstract: Face image quality plays a critical role in determining the accuracy and reliability of face verification systems, particularly in real-time screening applications such as surveillance, identity verification, and access control. Low-quality face images, often caused by factors such as motion blur, poor lighting conditions, occlusions, and extreme pose variations, significantly degrade the performance of face recognition models, leading to higher false rejection and false acceptance rates. In this work, we propose a lightweight yet effective framework for automatic face quality assessment, which aims to pre-filter low-quality face images before they are passed to the verification pipeline. Our approach utilises normalised facial landmarks in conjunction with a Random Forest Regression classifier to assess image quality, achieving an accuracy of 96.67%. By integrating this quality assessment module into the face verification process, we observe a substantial improvement in performance, including a comfortable 99.7% reduction in the false rejection rate and enhanced cosine similarity scores when paired with the ArcFace face verification model. To validate our approach, we have conducted experiments on a real-world dataset collected comprising over 600 subjects captured from CCTV footage in unconstrained environments within Dubai Police. Our results demonstrate that the proposed framework effectively mitigates the impact of poor-quality face images, outperforming existing face quality assessment techniques while maintaining computational efficiency. Moreover, the framework specifically addresses two critical challenges in real-time screening: variations in face resolution and pose deviations, both of which are prevalent in practical surveillance scenarios.

[91] FW-VTON: Flattening-and-Warping for Person-to-Person Virtual Try-on

Zheng Wang, Xianbing Sun, Shengyi Wu, Jiahui Zhan, Jianlou Si, Chi Zhang, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: This paper introduces FW-VTON, a novel virtual try-on method that enables person-to-person garment transfer using only two input images, achieving state-of-the-art performance through a three-stage approach of garment extraction, warping, and integration.

Details

Motivation: Traditional virtual try-on methods focus on garment-to-person tasks requiring flat garment representations, but there's a need for person-to-person try-on that can work with garments worn by different individuals using only two input images.

Method: FW-VTON operates in three stages: (1) extracting the flattened garment image from the source image, (2) warping the garment to align with the target pose, and (3) integrating the warped garment seamlessly onto the target person. The authors also created a new dataset specifically for person-to-person try-on scenarios.

Result: FW-VTON achieves state-of-the-art performance with superior results in both qualitative and quantitative assessments, and also excels in garment extraction subtasks compared to existing methods.

Conclusion: The proposed FW-VTON successfully addresses the person-to-person virtual try-on task, demonstrating superior performance over existing methods and establishing a new benchmark with the introduction of a specialized dataset for this task.

Abstract: Traditional virtual try-on methods primarily focus on the garment-to-person try-on task, which requires flat garment representations. In contrast, this paper introduces a novel approach to the person-to-person try-on task. Unlike the garment-to-person try-on task, the person-to-person task only involves two input images: one depicting the target person and the other showing the garment worn by a different individual. The goal is to generate a realistic combination of the target person with the desired garment. To this end, we propose Flattening-and-Warping Virtual Try-On (\textbf{FW-VTON}), a method that operates in three stages: (1) extracting the flattened garment image from the source image; (2) warping the garment to align with the target pose; and (3) integrating the warped garment seamlessly onto the target person. To overcome the challenges posed by the lack of high-quality datasets for this task, we introduce a new dataset specifically designed for person-to-person try-on scenarios. Experimental evaluations demonstrate that FW-VTON achieves state-of-the-art performance, with superior results in both qualitative and quantitative assessments, and also excels in garment extraction subtasks.

[92] Is Tracking really more challenging in First Person Egocentric Vision?

Matteo Dunnhofer, Zaira Manigrasso, Christian Micheloni

Main category: cs.CV

TL;DR: This paper introduces a benchmark study to separate the challenges of first-person viewpoint from human-object activity domain in egocentric visual tracking and segmentation, questioning whether performance drops are due to the egocentric perspective itself or the nature of human-object activities.

Details

Motivation: Previous research attributed poor performance in egocentric vision tracking to the first-person viewpoint, but these evaluations were conducted across different scenarios. The authors question whether the challenges come from the egocentric perspective itself or from the broader domain of human-object activities, which also appear in third-person videos.

Method: The authors introduce a new benchmark study with an evaluation strategy designed to disentangle and precisely separate challenges related to the first-person perspective from those linked to the broader domain of human-object activity understanding.

Result: The benchmark enables more precise separation of first-person viewpoint challenges from human-object activity domain challenges, providing deeper insights into the true sources of difficulty in egocentric tracking and segmentation.

Conclusion: The study facilitates more targeted advancements in egocentric tracking and segmentation by identifying the actual sources of difficulty, distinguishing between first-person perspective challenges and human-object activity domain challenges.

Abstract: Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.

[93] Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

Andrew Lu, Wentinn Liao, Liuhui Wang, Huzheng Yang, Jianbo Shi

Main category: cs.CV

TL;DR: This paper analyzes massive tokens and artifact tokens in vision transformers that act as attention sinks, and proposes Fast Nyström Attention (FNA) - a training-free method that exploits these token patterns to approximate self-attention in linear time while maintaining competitive performance across multiple vision tasks.

Details

Motivation: Vision transformers' inner workings remain partially understood, particularly the role of massive tokens (with high activation norms) and artifact tokens that emerge during inference. Understanding how these tokens regulate information flow through mutual suppression could lead to more efficient attention mechanisms.

Method: The authors introduce Fast Nyström Attention (FNA), a training-free method that approximates self-attention in linear time and space by exploiting structured patterns formed by massive and artifact tokens. They also propose a masking strategy to mitigate noise from these tokens.

Result: FNA demonstrates competitive performance on retrieval, classification, segmentation, and visual question answering (VQA) tasks while reducing computational overhead. The masking strategy yields modest performance gains at virtually no cost.

Conclusion: The mutual suppression between massive tokens and artifact tokens plays a critical role in regulating information flow in vision transformers. By exploiting these structured patterns, FNA can approximate self-attention efficiently while maintaining performance across diverse vision tasks.

Abstract: Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood. In this work, we examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference. Our analysis reveals that these tokens mutually suppress one another through the attention mechanism, playing a critical role in regulating information flow within the network. Leveraging these insights, we introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space by exploiting the structured patterns formed by massive and artifact tokens. Additionally, we propose a masking strategy to mitigate noise from these tokens, yielding modest performance gains at virtually no cost. We evaluate our approach on popular pretrained vision backbones and demonstrate competitive performance on retrieval, classification, segmentation, and visual question answering (VQA), all while reducing computational overhead.

[94] Discovering and using Spelke segments

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel Yamins

Main category: cs.CV

TL;DR: This paper introduces SpelkeNet, a visual world model that segments images based on Spelke objects (groups of things that move together physically) rather than semantic categories, achieving better performance than semantic segmentation methods like SAM on physical manipulation tasks.

Details

Motivation: Traditional computer vision segmentation relies on semantic categories, but developmental psychology suggests humans perceive the world through Spelke objects - physical groupings that move together under forces. This category-agnostic approach based on causal motion relationships could better support manipulation and planning tasks.

Method: The authors develop SpelkeNet, a visual world model that predicts future motion distributions. It estimates motion affordance maps (regions likely to move when poked) and expected-displacement maps (how the scene will move). These are used for “statistical counterfactual probing” where virtual pokes are applied to high-affordance regions, and the resulting displacement maps define Spelke segments through correlated motion statistics.

Result: SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on the newly introduced SpelkeBench dataset. The Spelke concept also shows superior performance on the 3DEditBench benchmark for physical object manipulation when integrated into various off-the-shelf manipulation models.

Conclusion: Spelke object segmentation based on physical motion relationships provides a more effective approach for manipulation tasks compared to traditional semantic segmentation, demonstrating the practical value of incorporating developmental psychology insights into computer vision systems.

Abstract: Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects–groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for “statistical counterfactual probing”, where diverse “virtual pokes” are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.

[95] Disrupting Semantic and Abstract Features for Better Adversarial Transferability

Yuyang Luo, Xiaosen Wang, Zhijin Ge, Yingzhe He

Main category: cs.CV

TL;DR: This paper proposes SAFER, a feature-level adversarial attack that improves transferability by disrupting both semantic and abstract (high-frequency) features using BLOCKMIX and SELF-MIX transformations to compute better feature importance weight matrices.

Details

Motivation: Existing feature-level adversarial attacks primarily focus on manipulating semantic information when computing feature importance weights, but CNNs also rely heavily on high-frequency components (abstract features like texture and edges). This presents an opportunity to improve attack transferability by targeting both types of features.

Method: The authors propose SAFER (Semantic and Abstract FEatures disRuption), which uses BLOCKMIX on input images and SELF-MIX on frequency spectrum when computing the feature importance weight matrix. This balanced approach allows the attack to disrupt both semantic and abstract features simultaneously.

Result: Extensive experiments on the ImageNet dataset demonstrate that SAFER effectively boosts adversarial transferability compared to existing feature-level attacks that only focus on semantic features.

Conclusion: By incorporating both semantic and abstract feature disruption through frequency domain transformations, SAFER achieves improved transferability in black-box adversarial attacks against deep neural networks, making it more effective for real-world applications.

Abstract: Adversarial examples pose significant threats to deep neural networks (DNNs), and their property of transferability in the black-box setting has led to the emergence of transfer-based attacks, making it feasible to target real-world applications employing DNNs. Among them, feature-level attacks, where intermediate features are perturbed based on feature importance weight matrix computed from transformed images, have gained popularity. In this work, we find that existing feature-level attacks primarily manipulate the semantic information to derive the weight matrix. Inspired by several works that find CNNs tend to focus more on high-frequency components (a.k.a. abstract features, e.g., texture, edge, etc.), we validate that transforming images in the high-frequency space also improves transferability. Based on this finding, we propose a balanced approach called Semantic and Abstract FEatures disRuption (SAFER). Specifically, SAFER conducts BLOCKMIX on the input image and SELF-MIX on the frequency spectrum when computing the weight matrix to highlight crucial features. By using such a weight matrix, we can direct the attacker to disrupt both semantic and abstract features, leading to improved transferability. Extensive experiments on the ImageNet dataset also demonstrate the effectiveness of our method in boosting adversarial transferability.

[96] Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach

Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe

Main category: cs.CV

TL;DR: This paper presents optimization techniques for deploying DNN-based hyperspectral imaging (HSI) segmentation systems on FPGA-based SoCs for autonomous driving systems, achieving significant computational efficiency improvements while maintaining accuracy.

Details

Motivation: Safety-critical autonomous driving systems require HSI-based vision systems that can operate under strict constraints on latency, resource consumption, and security. Current DNN models are over-parameterized and computationally intensive, making real-time edge deployment challenging, especially when combined with the intensive data preprocessing requirements of HSI.

Method: The authors employ a comprehensive software/hardware co-design approach including: (1) functional software/hardware task distribution, (2) hardware-aware preprocessing optimization, (3) ML model compression techniques, and (4) complete pipelined deployment on FPGA-based SoC platforms.

Result: The compression techniques successfully reduced the DNN complexity to 24.34% of original operations and 1.02% of original parameters, achieving a 2.86x speed-up in inference tasks without noticeable degradation in segmentation accuracy.

Conclusion: The proposed optimization framework enables practical deployment of DNN-based HSI segmentation systems on resource-constrained edge platforms for autonomous driving applications, demonstrating that significant computational efficiency gains can be achieved while preserving system performance.

Abstract: The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.

Parul Gupta, Abhinav Dhall, Thanh-Toan Do

Main category: cs.CV

TL;DR: This paper proposes a feedback-based fine-tuning approach for personalized image generation that uses state-of-the-art detectors for pose, human-object-interaction, facial recognition, and gaze estimation to improve the quality of generated images with better human poses, preserved identities, and natural gaze patterns.

Details

Motivation: Existing personalized image generation methods suffer from three major limitations: improper generation of complex activities with incorrect human poses, failure to preserve reference human identities, and unnatural/inconsistent human gaze patterns that don't match scene descriptions.

Method: The authors propose feedback-based fine-tuning of existing personalized generation methods using state-of-the-art detectors for pose, human-object-interaction, facial recognition, and gaze-point estimation. They introduce timestep-based integration of different feedback modules, categorizing signals as low-level (human pose) or high-level (gaze point) to refine the diffusion model.

Result: The generated images demonstrate improved interactions, facial identities, and overall image quality compared to existing methods, as evaluated on three benchmark datasets.

Conclusion: The feedback-based fine-tuning approach successfully addresses the key limitations of personalized image generation by incorporating multiple specialized detectors and timestep-based feedback integration, resulting in more accurate and realistic personalized images with better pose accuracy, identity preservation, and natural gaze patterns.

Abstract: Personalized image generation, where reference images of one or more subjects are used to generate their image according to a scene description, has gathered significant interest in the community. However, such generated images suffer from three major limitations – complex activities, such as $<$man, pushing, motorcycle$>$ are not generated properly with incorrect human poses, reference human identities are not preserved, and generated human gaze patterns are unnatural/inconsistent with the scene description. In this work, we propose to overcome these shortcomings through feedback-based fine-tuning of existing personalized generation methods, wherein, state-of-art detectors of pose, human-object-interaction, human facial recognition and human gaze-point estimation are used to refine the diffusion model. We also propose timestep-based inculcation of different feedback modules, depending upon whether the signal is low-level (such as human pose), or high-level (such as gaze point). The images generated in this manner show an improvement in the generated interactions, facial identities and image quality over three benchmark datasets.

[98] Stop-band Energy Constraint for Orthogonal Tunable Wavelet Units in Convolutional Neural Networks for Computer Vision problems

An D. Le, Hung Nguyen, Sungbal Seo, You-Suk Bae, Truong Q. Nguyen

Main category: cs.CV

TL;DR: This paper introduces stop-band energy constraints for orthogonal tunable wavelet units in CNNs to improve image classification and anomaly detection, achieving significant accuracy gains on texture-rich datasets like CIFAR-10 and Describable Textures when integrated into ResNet architectures.

Details

Motivation: The motivation is to improve CNN performance on image classification and anomaly detection tasks, particularly for texture-rich datasets, by leveraging wavelet-based filtering techniques that can better capture textural features than traditional convolution operations.

Method: The method introduces stop-band energy constraints for filters in orthogonal tunable wavelet units with lattice structure, which are integrated into CNN architectures (ResNet-18/34) to enhance convolution, pooling, and downsampling operations.

Result: The method achieves accuracy improvements of 2.48% on CIFAR-10 and 13.56% on Describable Textures dataset when integrated into ResNet-18, with similar improvements observed in ResNet-34. On MVTec hazelnut anomaly detection, it achieves competitive results in both segmentation and detection tasks, outperforming existing approaches.

Conclusion: The proposed stop-band energy constraint for orthogonal tunable wavelet units effectively enhances CNN performance on both image classification and anomaly detection tasks, particularly excelling on texture-rich datasets and demonstrating its effectiveness across different ResNet architectures.

Abstract: This work introduces a stop-band energy constraint for filters in orthogonal tunable wavelet units with a lattice structure, aimed at improving image classification and anomaly detection in CNNs, especially on texture-rich datasets. Integrated into ResNet-18, the method enhances convolution, pooling, and downsampling operations, yielding accuracy gains of 2.48% on CIFAR-10 and 13.56% on the Describable Textures dataset. Similar improvements are observed in ResNet-34. On the MVTec hazelnut anomaly detection task, the proposed method achieves competitive results in both segmentation and detection, outperforming existing approaches.

[99] PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

Main category: cs.CV

TL;DR: Pusa introduces vectorized timestep adaptation (VTA) for video diffusion models, achieving state-of-the-art image-to-video generation performance with dramatically reduced training costs (1/200th) and dataset requirements (1/2500th) while enabling zero-shot multi-task capabilities.

Details

Motivation: Current video diffusion models are limited by rigid temporal modeling through scalar timestep variables, leading to computational inefficiency, catastrophic forgetting, and narrow applicability in existing task-specific adaptations and autoregressive approaches.

Method: The paper presents Pusa, which uses vectorized timestep adaptation (VTA) - a non-destructive adaptation technique that enables fine-grained temporal control within a unified video diffusion framework by finetuning the Wan2.1-T2V-14B model while preserving base model capabilities.

Result: Pusa achieves a VBench-I2V total score of 87.32% (vs. 86.86% for Wan-I2V-14B) with only $500 training cost versus ≥$100,000 and 4K samples versus ≥10M samples. It enables zero-shot capabilities including start-end frames, video extension, and text-to-video generation without task-specific training.

Conclusion: The work establishes a scalable, efficient, and versatile paradigm for video synthesis that preserves foundation model generative priors while surgically injecting temporal dynamics, democratizing high-fidelity video generation for both research and industry applications.

Abstract: The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency – surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost ($500 vs. $\geq$ $100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension – all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model’s generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

[100] Universal Wavelet Units in 3D Retinal Layer Segmentation

An D. Le, Hung Nguyen, Melanie Tran, Jesse Most, Dirk-Uwe G. Bartsch, William R Freeman, Shyamanga Borooah, Truong Q. Nguyen, Cheolhong An

Main category: cs.CV

TL;DR: First study applying tunable wavelet units (UwUs) for 3D retinal layer segmentation from OCT volumes, integrating three wavelet-based downsampling modules into MGU-Net architecture to overcome conventional max-pooling limitations and achieve improved segmentation accuracy.

Details

Motivation: Conventional max-pooling has limitations in preserving spatial detail and structural consistency in 3D retinal layer segmentation from OCT volumes. There is a need for better downsampling methods that can maintain both low- and high-frequency features for accurate volumetric medical image segmentation.

Method: Integration of three wavelet-based downsampling modules (OrthLattUwU, BiorthLattUwU, and LS-BiorthLattUwU) into a motion-corrected MGU-Net architecture. These modules use learnable lattice filter banks to preserve both low- and high-frequency features during downsampling operations.

Result: Evaluation on the Jacobs Retina Center (JRC) OCT dataset demonstrated significant improvements in accuracy and Dice score compared to conventional approaches. The LS-BiorthLattUwU module showed particularly strong performance among the three wavelet-based modules tested.

Conclusion: Tunable wavelet filters provide substantial benefits for volumetric medical image segmentation, particularly in 3D retinal layer segmentation. The proposed wavelet-based downsampling modules successfully overcome the limitations of conventional max-pooling and enhance both spatial detail preservation and structural consistency in OCT volume analysis.

Abstract: This paper presents the first study to apply tunable wavelet units (UwUs) for 3D retinal layer segmentation from Optical Coherence Tomography (OCT) volumes. To overcome the limitations of conventional max-pooling, we integrate three wavelet-based downsampling modules, OrthLattUwU, BiorthLattUwU, and LS-BiorthLattUwU, into a motion-corrected MGU-Net architecture. These modules use learnable lattice filter banks to preserve both low- and high-frequency features, enhancing spatial detail and structural consistency. Evaluated on the Jacobs Retina Center (JRC) OCT dataset, our framework shows significant improvement in accuracy and Dice score, particularly with LS-BiorthLattUwU, highlighting the benefits of tunable wavelet filters in volumetric medical image segmentation.

[101] LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images

Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, Yunde Jia

Main category: cs.CV

TL;DR: LongSplat is an online real-time 3D Gaussian reconstruction framework that uses a streaming update mechanism and Gaussian-Image Representation (GIR) to efficiently handle long-sequence image input while maintaining high-quality novel view synthesis and reducing computational costs.

Details

Motivation: Existing 3D Gaussian Splatting methods are limited for online long-sequence scenarios because they either require slow per-scene optimization or lack efficient incremental updates, which hinders continuous performance in real-time applications.

Method: The paper introduces LongSplat with two key components: (1) A streaming update mechanism that incrementally integrates current-view observations while selectively compressing redundant historical Gaussians, and (2) Gaussian-Image Representation (GIR) that encodes 3D Gaussian parameters into a structured 2D image-like format to enable efficient fusion and identity-aware redundancy compression. The method also leverages existing image compression techniques to generate more compact and higher-quality 3D Gaussians.

Result: LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction performance while reducing Gaussian counts by 44% compared to existing per-pixel Gaussian prediction methods.

Conclusion: The proposed LongSplat framework successfully addresses the limitations of existing 3D Gaussian Splatting methods for online long-sequence scenarios by providing an efficient streaming update mechanism and novel representation that enables real-time reconstruction without overwhelming memory or computational costs.

Abstract: 3D Gaussian Splatting achieves high-fidelity novel view synthesis, but its application to online long-sequence scenarios is still limited. Existing methods either rely on slow per-scene optimization or fail to provide efficient incremental updates, hindering continuous performance. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea is a streaming update mechanism that incrementally integrates current-view observations while selectively compressing redundant historical Gaussians. Crucial to this mechanism is our Gaussian-Image Representation (GIR), a representation that encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables efficient fusion of current-view and historical Gaussians and identity-aware redundancy compression. These functions enable online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Furthermore, we leverage an existing image compression method to guide the generation of more compact and higher-quality 3D Gaussians. Extensive evaluations demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44% compared to existing per-pixel Gaussian prediction methods.

[102] SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities

Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu

Main category: cs.CV

TL;DR: This paper introduces the first video action recognition dataset using spike cameras, synchronized with RGB and thermal data, to benchmark Spiking Neural Networks for energy-efficient video understanding.

Details

Motivation: Spike cameras offer ultra-high energy efficiency and exceptional temporal resolution compared to traditional cameras, but lack dedicated datasets for video action recognition tasks. There's a need for comprehensive benchmarking platforms that preserve the inherent sparsity and temporal precision of spiking data for multimodal video understanding.

Method: The authors created three synchronized datasets combining spike camera data with RGB and thermal modalities for video action recognition. The datasets preserve the natural sparsity and temporal characteristics of spike-based sensors to enable direct comparison between spiking, thermal, and RGB modalities.

Result: Successfully established the first video action recognition dataset using spike cameras with synchronized multimodal data (RGB and thermal), providing a unique platform for exploring multimodal video understanding and enabling comprehensive benchmarking for Spiking Neural Networks.

Conclusion: This work provides a valuable resource for advancing research in energy-efficient, ultra-low-power video understanding by offering novel spike-based datasets that will drive development in action recognition tasks using bio-inspired vision sensors.

Abstract: Spike cameras, bio-inspired vision sensors, asynchronously fire spikes by accumulating light intensities at each pixel, offering ultra-high energy efficiency and exceptional temporal resolution. Unlike event cameras, which record changes in light intensity to capture motion, spike cameras provide even finer spatiotemporal resolution and a more precise representation of continuous changes. In this paper, we introduce the first video action recognition (VAR) dataset using spike camera, alongside synchronized RGB and thermal modalities, to enable comprehensive benchmarking for Spiking Neural Networks (SNNs). By preserving the inherent sparsity and temporal precision of spiking data, our three datasets offer a unique platform for exploring multimodal video understanding and serve as a valuable resource for directly comparing spiking, thermal, and RGB modalities. This work contributes a novel dataset that will drive research in energy-efficient, ultra-low-power video understanding, specifically for action recognition tasks using spike-based data.

[103] LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation

Jyun-Ze Tang, Chih-Fan Hsu, Jeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen

Main category: cs.CV

TL;DR: The paper proposes LSSGen, a framework that performs resolution scaling directly in latent space using a lightweight upsampler, improving efficiency and visual quality in text-to-image generation while avoiding artifacts from traditional pixel-space scaling methods.

Details

Motivation: Traditional methods that downscale and upscale in pixel space for speeding up text-to-image synthesis introduce artifacts and distortions when upscaled images are re-encoded into latent space, leading to degraded final image quality.

Method: Latent Space Scaling Generation (LSSGen) framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler, without altering the Transformer or U-Net architecture.

Result: LSSGen significantly outperforms conventional scaling approaches in text-image alignment and perceptual quality evaluation, achieving up to 246% TOPIQ score improvement when generating 1024² images at similar speeds.

Conclusion: LSSGen successfully addresses the artifacts and quality degradation issues of traditional pixel-space scaling by operating directly in latent space, providing improved efficiency, visual quality, and flexible multi-resolution generation capabilities.

Abstract: Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246% TOPIQ score improvement.

[104] DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

Francisco Caetano, Christiaan Viviers, Luis A. Zavala-Mondragón, Peter H. N. de With, Fons van der Sommen

Main category: cs.CV

TL;DR: DisCoPatch is an unsupervised Adversarial VAE framework that exploits batch statistics differences between real and adversarial samples in BN-trained discriminators to achieve state-of-the-art OOD detection, particularly excelling at detecting subtle covariate shifts with high efficiency (25MB model, low latency).

Details

Motivation: While semantic and domain-shift OOD detection are well-studied, covariate shifts (subtle data distribution variations) remain challenging. The authors hypothesize that detecting these subtle shifts can improve understanding of in-distribution boundaries and ultimately enhance OOD detection performance.

Method: DisCoPatch uses an unsupervised Adversarial VAE framework that exploits the property that real and adversarial samples form distinct domains with unique batch statistics in BN-trained discriminators. During inference, batches consist of patches from the same image for consistent distribution. The VAE’s suboptimal outputs (generated and reconstructed samples) serve as negative samples to train the discriminator, tightening the boundary between in-distribution samples and covariate shifts.

Result: Achieves state-of-the-art performance with 95.5% AUROC on ImageNet-1K(-C) for covariate shift detection and 95.0% on Near-OOD benchmarks, outperforming all prior methods. The model is highly efficient with only 25MB size and notably lower latency than existing methods.

Conclusion: DisCoPatch successfully addresses the challenging problem of covariate shift detection in OOD scenarios by leveraging batch statistics differences in adversarial training. The method achieves superior performance while maintaining high efficiency, making it a practical solution for real-world OOD detection applications.

Abstract: Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE’s suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code is publicly available.

Hui Ye, Haodong Chen, Zeke Zexi Hu, Xiaoming Chen, Yuk Ying Chung

Main category: cs.CV

TL;DR: AMMNet is a novel asymmetric multi-modal network for semantic segmentation in remote sensing that efficiently integrates RGB imagery and Digital Surface Model (DSM) data while addressing computational complexity and modality misalignment issues through three key components: Asymmetric Dual Encoder, Asymmetric Prior Fuser, and Distribution Alignment module.

Details

Motivation: Current multi-modal semantic segmentation methods in remote sensing face two major limitations when integrating RGB and DSM data: (1) increased computational complexity due to architectural redundancy, and (2) degraded segmentation performance caused by modality misalignment. These issues undermine efficiency and robustness, particularly in complex urban environments where precise multi-modal integration is essential.

Method: The proposed AMMNet employs three key designs: (1) Asymmetric Dual Encoder (ADE) that assigns different representational capacities based on modality characteristics - deeper encoder for RGB to capture rich contextual information and lightweight encoder for DSM to extract sparse structural features; (2) Asymmetric Prior Fuser (APF) that integrates a modality-aware prior matrix to generate structure-aware contextual features; (3) Distribution Alignment (DA) module that enhances cross-modal compatibility by aligning feature distributions through divergence minimization.

Result: Extensive experiments on ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet achieves state-of-the-art segmentation accuracy among multi-modal networks while simultaneously reducing computational and memory requirements compared to existing methods.

Conclusion: AMMNet successfully addresses the key challenges in multi-modal remote sensing semantic segmentation by providing an asymmetric architecture that balances computational efficiency with segmentation performance, making it particularly suitable for complex urban environment analysis where both RGB contextual information and DSM structural data are crucial.

Abstract: Semantic segmentation in remote sensing (RS) has advanced significantly with the incorporation of multi-modal data, particularly the integration of RGB imagery and the Digital Surface Model (DSM), which provides complementary contextual and structural information about the ground object. However, integrating RGB and DSM often faces two major limitations: increased computational complexity due to architectural redundancy, and degraded segmentation performance caused by modality misalignment. These issues undermine the efficiency and robustness of semantic segmentation, particularly in complex urban environments where precise multi-modal integration is essential. To overcome these limitations, we propose Asymmetric Multi-Modal Network (AMMNet), a novel asymmetric architecture that achieves robust and efficient semantic segmentation through three designs tailored for RGB-DSM input pairs. To reduce architectural redundancy, the Asymmetric Dual Encoder (ADE) module assigns representational capacity based on modality-specific characteristics, employing a deeper encoder for RGB imagery to capture rich contextual information and a lightweight encoder for DSM to extract sparse structural features. Besides, to facilitate modality alignment, the Asymmetric Prior Fuser (APF) integrates a modality-aware prior matrix into the fusion process, enabling the generation of structure-aware contextual features. Additionally, the Distribution Alignment (DA) module enhances cross-modal compatibility by aligning feature distributions through divergence minimization. Extensive experiments on the ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet attains state-of-the-art segmentation accuracy among multi-modal networks while reducing computational and memory requirements.

[106] One-Shot Affordance Grounding of Deformable Objects in Egocentric Organizing Scenes

Wanjun Jia, Fan Yang, Mengfei Duan, Xianchi Chen, Yinxi Wang, Yiming Jiang, Wenrui Chen, Kailun Yang, Zhiyong Li

Main category: cs.CV

TL;DR: This paper proposes OS-AGDO, a one-shot learning method for robots to recognize and manipulate deformable objects in organizing scenes using minimal samples, achieving significant performance improvements over existing methods.

Details

Motivation: Deformable object manipulation in robotics faces major challenges including uncertainties in object properties, diverse configurations, visual interference, and ambiguous prompts, which complicate both perception and control tasks for robots.

Method: The method consists of three key components: (1) DefoSEM - enhances hierarchical understanding of internal structure and local feature identification, (2) OEKFM - optimizes feature extraction using geometric constraints and improves adaptability, and (3) instance-conditional prompts based on image data and task context to reduce region ambiguity.

Result: The approach achieves significant improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics respectively compared to state-of-the-art methods, while demonstrating high generalization performance on the AGDDO15 dataset containing 15 types of deformable objects.

Conclusion: The proposed OS-AGDO method successfully enables robots to recognize previously unseen deformable objects with varying properties using minimal samples, significantly outperforming existing approaches and showing strong generalization capabilities in real-world scenarios.

Abstract: Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, which effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset are made publicly available at https://github.com/Dikay1/OS-AGDO.

[107] AtrousMamaba: An Atrous-Window Scanning Visual State Space Model for Remote Sensing Change Detection

Tao Wang, Tiecheng Bai, Chao Xu, Bin Liu, Erlei Zhang, Jiyun Huang, Hongming Zhang

Main category: cs.CV

TL;DR: The paper proposes AtrousMamba, a novel visual state space model that combines Mamba’s linear complexity for long sequences with enhanced local feature extraction through an atrous-window selective scan mechanism, achieving superior performance on change detection tasks.

Details

Motivation: Existing Mamba-based methods focus on enhancing global receptive fields but overlook the importance of local information in dense prediction tasks. There's an open question about whether Mamba can effectively extract local features like CNNs do, which is critical for visual tasks requiring fine-grained detail preservation.

Method: The authors develop AtrousMamba with an atrous-window selective scan mechanism that gradually expands scanning range with adjustable rates, shortening distances between adjacent tokens. They design the atrous window scan visual state space (AWVSS) module and create dedicated end-to-end frameworks for binary change detection (AWMambaBCD) and semantic change detection (AWMambaSCD).

Result: Experimental evaluation on six benchmark datasets demonstrates that the proposed framework outperforms existing CNN-based, Transformer-based, and Mamba-based methods in both binary and semantic change detection tasks.

Conclusion: The research proves that Mamba can effectively balance long-range dependency modeling with fine-grained local detail preservation, demonstrating its capability to extract local features comparable to CNNs while maintaining its advantages in processing long sequences with linear complexity.

Abstract: Recently, a novel visual state space (VSS) model, referred to as Mamba, has demonstrated significant progress in modeling long sequences with linear complexity, comparable to Transformer models, thereby enhancing its adaptability for processing visual data. Although most methods aim to enhance the global receptive field by directly modifying Mamba’s scanning mechanism, they tend to overlook the critical importance of local information in dense prediction tasks. Additionally, whether Mamba can effectively extract local features as convolutional neural networks (CNNs) do remains an open question that merits further investigation. In this paper, We propose a novel model, AtrousMamba, which effectively balances the extraction of fine-grained local details with the integration of global contextual information. Specifically, our method incorporates an atrous-window selective scan mechanism, enabling a gradual expansion of the scanning range with adjustable rates. This design shortens the distance between adjacent tokens, enabling the model to effectively capture fine-grained local features and global context. By leveraging the atrous window scan visual state space (AWVSS) module, we design dedicated end-to-end Mamba-based frameworks for binary change detection (BCD) and semantic change detection (SCD), referred to as AWMambaBCD and AWMambaSCD, respectively. Experimental results on six benchmark datasets show that the proposed framework outperforms existing CNN-based, Transformer-based, and Mamba-based methods. These findings clearly demonstrate that Mamba not only captures long-range dependencies in visual data but also effectively preserves fine-grained local details.

[108] Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance

Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, Kailun Yang

Main category: cs.CV

TL;DR: SHIFNet is a novel framework that adapts SAM2 for RGB-Thermal perception by introducing cross-modal fusion and heterogeneous prompting to overcome SAM2’s RGB bias, achieving state-of-the-art segmentation performance with 89.8% on PST900 and 67.8% on FMB datasets.

Details

Motivation: SAM2, despite its strong perception potential from large-scale training, has inherent RGB bias that prevents its effective application to RGB-Thermal (RGB-T) tasks. The challenge is to unlock SAM2's potential for multi-modal perception while addressing the high costs of data collection in robotic systems.

Method: The paper proposes SHIFNet with two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that uses text-guided affinity learning to dynamically balance RGB and thermal modality contributions, and (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information and combines it with category embeddings to improve cross-modal semantic consistency.

Result: SHIFNet achieves state-of-the-art segmentation performance on public benchmarks with 89.8% accuracy on PST900 dataset and 67.8% on FMB dataset, using only 32.27M trainable parameters. The framework successfully adapts pre-trained large models to RGB-T segmentation tasks.

Conclusion: The framework effectively mitigates high data collection costs while providing robotic systems with comprehensive perception capabilities. SHIFNet successfully addresses SAM2’s RGB bias and enables efficient RGB-Thermal perception through linguistic guidance and hybrid interaction paradigms.

Abstract: The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2’s inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.

[109] Explicit Context Reasoning with Supervision for Visual Tracking

Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xiantao Hu, Liangtao Shi, Shuxiang Song

Main category: cs.CV

TL;DR: RSTrack proposes a visual tracking algorithm that explicitly models contextual reasoning through three mechanisms: context reasoning, forward supervision, and efficient state modeling to improve temporal consistency and achieve state-of-the-art performance while maintaining real-time speeds.

Details

Motivation: Mainstream tracking algorithms associate context by simply stacking historical information without explicit supervision, making it difficult to effectively model target's evolving dynamics and maintain temporal consistency in cross-frame modeling.

Method: RSTrack employs three core mechanisms: 1) Context Reasoning Mechanism that constructs a target state reasoning pipeline converting contextual associations into temporal reasoning, 2) Forward Supervision Strategy using true target features as anchors to constrain reasoning and prevent drift, and 3) Efficient State Modeling with compression-reconstruction to extract core features and remove redundant information.

Result: RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds, effectively alleviating contextual association divergence in traditional temporal modeling.

Conclusion: The three collaborative mechanisms successfully address the problem of contextual association divergence in temporal modeling, demonstrating that explicit supervision of context reasoning can significantly improve visual tracking performance without sacrificing computational efficiency.

Abstract: Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target’s evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.

[110] Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras

Shuang Guo, Friedhelm Hamann, Guillermo Gallego

Main category: cs.CV

TL;DR: This paper proposes an unsupervised learning framework that jointly estimates optical flow and image intensity from event camera data using a single network, achieving state-of-the-art performance with 20% reduction in EPE and 25% reduction in AE compared to unsupervised approaches.

Details

Motivation: Event cameras inherently link appearance and motion - both are either present together or absent in event data. Previous works treat optical flow (motion) and image intensity (appearance) recovery as separate tasks, which doesn't align with the fundamental nature of event cameras and ignores the inherent relationships between these visual quantities.

Method: The authors develop an unsupervised learning framework using a single network to jointly estimate optical flow and image intensity. They derive a new event-based photometric error as a function of both optical flow and image intensity from the data generation model, and combine this with the contrast maximization framework to create a comprehensive loss function that properly constrains both estimations.

Result: The method achieves state-of-the-art performance in optical flow estimation with 20% reduction in End Point Error (EPE) and 25% reduction in Angular Error (AE) compared to unsupervised approaches. It also delivers competitive intensity estimation results, especially in high dynamic range scenarios, while maintaining shorter inference time than other optical flow methods and many image reconstruction methods.

Conclusion: The joint estimation approach successfully leverages the inherent coupling between motion and appearance in event camera data, demonstrating superior performance in both optical flow and intensity estimation tasks while being computationally efficient. This validates the importance of considering the fundamental characteristics of event cameras when designing estimation algorithms.

Abstract: Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method’s state-of-the-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI

[111] A Single-step Accurate Fingerprint Registration Method Based on Local Feature Matching

Yuwei Jia, Zhe Cui, Fei Su

Main category: cs.CV

TL;DR: An end-to-end single-step fingerprint registration algorithm that directly predicts semi-dense matching points to align fingerprints, avoiding the common failure of traditional two-step methods when minutiae detection is poor in low-quality images.

Details

Motivation: Traditional fingerprint registration methods use a two-step process (initial minutiae-based registration followed by dense registration), but this approach frequently fails when fingerprint image quality is low because fewer minutiae are detected, causing the initial registration to fail and ultimately causing the entire registration process to fail.

Method: An end-to-end single-step registration algorithm that directly predicts semi-dense matching point correspondences between two fingerprints using global-local attention mechanisms to achieve pixel-level alignment, bypassing the need for initial minutiae-based registration.

Result: The proposed method achieves state-of-the-art matching performance with only single-step registration and can be combined with dense registration algorithms for further performance improvements, while minimizing the risk of registration failure compared to traditional two-step approaches.

Conclusion: The single-step fingerprint registration approach successfully addresses the limitations of traditional two-step methods by eliminating dependency on minutiae detection quality, providing robust performance even with low-quality fingerprint images while maintaining state-of-the-art accuracy.

Abstract: Distortion of the fingerprint images leads to a decline in fingerprint recognition performance, and fingerprint registration can mitigate this distortion issue by accurately aligning two fingerprint images. Currently, fingerprint registration methods often consist of two steps: an initial registration based on minutiae, and a dense registration based on matching points. However, when the quality of fingerprint image is low, the number of detected minutiae is reduced, leading to frequent failures in the initial registration, which ultimately causes the entire fingerprint registration process to fail. In this study, we propose an end-to-end single-step fingerprint registration algorithm that aligns two fingerprints by directly predicting the semi-dense matching points correspondences between two fingerprints. Thus, our method minimizes the risk of minutiae registration failure and also leverages global-local attentions to achieve end-to-end pixel-level alignment between the two fingerprints. Experiment results prove that our method can achieve the state-of-the-art matching performance with only single-step registration, and it can also be used in conjunction with dense registration algorithms for further performance improvements.

[112] Advancing Visual Large Language Model for Multi-granular Versatile Perception

Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang

Main category: cs.CV

TL;DR: MVP-LM is a unified visual perception framework that integrates Visual Large Language Models to handle diverse perception tasks (detection, segmentation, grounding) across both word-based and sentence-based instructions using a multi-granularity decoder and CoT-inspired dataset unification.

Details

Motivation: Existing computer vision perception research typically focuses on limited subsets of perception tasks, constraining their applicability and versatility. There's a need for a unified framework that can handle the full spectrum of perception tasks systematically categorized by prediction type and instruction type.

Method: The paper proposes MVP-LM with: (1) a multi-granularity decoder that handles both word-based and sentence-based perception tasks, (2) a CoT-inspired dataset unification strategy for seamless supervised fine-tuning, (3) integration of box and mask predictions in a single architecture, and (4) a query enhancement strategy to leverage VLLM decoding and generative capabilities.

Result: Extensive experiments across various benchmarks demonstrate the efficacy of MVP-LM on both word-based and sentence-based perception tasks, including panoptic segmentation, detection, grounding, and referring expression segmentation.

Conclusion: MVP-LM successfully unifies diverse perception tasks within a single Visual Large Language Model framework, providing improved versatility and applicability across different computer vision contexts through its multi-granular approach and innovative dataset unification strategy.

Abstract: Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.

Jijun Wang, Yan Wu, Yujian Mo, Junqiao Zhao, Jun Yan, Yinghao Hu

Main category: cs.CV

TL;DR: LDRFusion proposes a LiDAR-dominant two-stage framework for 3D object detection that first uses LiDAR alone for accurate proposals, then incorporates pseudo point clouds for challenging instances, achieving strong performance on KITTI dataset.

Details

Motivation: Existing LiDAR-Camera fusion methods suffer from noise introduced by pseudo point clouds generated through depth completion, which can lead to inaccurate predictions. The authors recognize that different modalities have varying roles and reliability levels in 3D object detection.

Method: A novel two-stage LiDAR-dominant refinement framework: (1) First stage uses LiDAR only to generate accurately localized proposals, (2) Second stage incorporates pseudo point clouds to detect challenging instances, (3) Results from both stages are merged. Additionally, introduces a hierarchical pseudo point residual encoding module that encodes neighborhood sets using both feature and positional residuals to enhance local structure representation.

Result: Experiments on the KITTI dataset show that the framework consistently achieves strong performance across multiple object categories and difficulty levels, demonstrating the effectiveness of the LiDAR-dominant approach.

Conclusion: The LiDAR-dominant two-stage approach effectively addresses the noise issues in pseudo point clouds while leveraging the strengths of both LiDAR and camera modalities, resulting in improved 3D object detection performance across various scenarios.

Abstract: Existing LiDAR-Camera fusion methods have achieved strong results in 3D object detection. To address the sparsity of point clouds, previous approaches typically construct spatial pseudo point clouds via depth completion as auxiliary input and adopts a proposal-refinement framework to generate detection results. However, introducing pseudo points inevitably brings noise, potentially resulting in inaccurate predictions. Considering the differing roles and reliability levels of each modality, we propose LDRFusion, a novel Lidar-dominant two-stage refinement framework for multi-sensor fusion. The first stage soley relies on LiDAR to produce accurately localized proposals, followed by a second stage where pseudo point clouds are incorporated to detect challenging instances. The instance-level results from both stages are subsequently merged. To further enhance the representation of local structures in pseudo point clouds, we present a hierarchical pseudo point residual encoding module, which encodes neighborhood sets using both feature and positional residuals. Experiments on the KITTI dataset demonstrate that our framework consistently achieves strong performance across multiple categories and difficulty levels.

[114] MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing

Shreelekha Revankar, Utkarsh Mall, Cheng Perng Phoo, Kavita Bala, Bharath Hariharan

Main category: cs.CV

TL;DR: MONITRS introduces a comprehensive multimodal dataset of 10,000+ FEMA disaster events combining temporal satellite imagery with natural language annotations from news articles to improve machine learning-based disaster monitoring and response systems.

Details

Motivation: Current disaster response systems are limited by difficulty accessing affected areas, narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression.

Method: Created MONITRS, a multimodal dataset containing over 10,000 FEMA disaster events with temporal satellite imagery, natural language annotations from news articles, geotagged locations, and question-answer pairs. Fine-tuned existing Multimodal Large Language Models (MLLMs) on this dataset.

Result: Fine-tuning existing MLLMs on the MONITRS dataset yielded significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems.

Conclusion: The MONITRS dataset successfully addresses key limitations in current disaster monitoring approaches by providing comprehensive multimodal data that enables better automated analysis of disasters through improved machine learning models, potentially enhancing disaster response capabilities.

Abstract: Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems. Code can be found at: https://github.com/ShreelekhaR/MONITRS

[115] Positive Style Accumulation: A Style Screening and Continuous Utilization Framework for Federated DG-ReID

Xin Xu, Chaoyue Ren, Wei Liu, Wenke Huang, Bin Yang, Zhixi Yu, Kui Jiang

Main category: cs.CV

TL;DR: This paper proposes SSCU framework for federated domain generalization in person re-identification that screens positive styles beneficial for generalization and continuously utilizes them through dynamic memory and collaborative training strategies.

Details

Motivation: Existing federated domain generalization methods for person re-identification use style transformation to improve diversity, but not all generated styles contribute positively to generalization performance. The paper identifies that some styles are beneficial (positive) while others are harmful (negative) to model generalization, creating the need to effectively screen and continuously utilize only the positive styles.

Method: The authors propose the Style Screening and Continuous Utilization (SSCU) framework with three key components: (1) Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client to screen and accumulate positive styles, (2) style memory recognition loss to leverage memorized positive styles, and (3) Collaborative Style Training (CST) strategy that trains client models on two branches using both newly generated styles and accumulated positive styles from memory.

Result: Extensive experimental results show that the proposed method outperforms existing methods in both source domain and target domain performance for federated domain generalization in person re-identification tasks.

Conclusion: The SSCU framework effectively addresses the challenge of style selection in federated domain generalization by distinguishing between positive and negative styles, and provides a comprehensive solution for screening, memorizing, and continuously utilizing beneficial styles to improve model generalization performance across domains.

Abstract: The Federated Domain Generalization for Person re-identification (FedDG-ReID) aims to learn a global server model that can be effectively generalized to source and target domains through distributed source domain data. Existing methods mainly improve the diversity of samples through style transformation, which to some extent enhances the generalization performance of the model. However, we discover that not all styles contribute to the generalization performance. Therefore, we define styles that are beneficial or harmful to the model’s generalization performance as positive or negative styles. Based on this, new issues arise: How to effectively screen and continuously utilize the positive styles. To solve these problems, we propose a Style Screening and Continuous Utilization (SSCU) framework. Firstly, we design a Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client model to screen and accumulate generated positive styles. Meanwhile, we propose a style memory recognition loss to fully leverage the positive styles memorized by Memory. Furthermore, we propose a Collaborative Style Training (CST) strategy to make full use of positive styles. Unlike traditional learning strategies, our approach leverages both newly generated styles and the accumulated positive styles stored in memory to train client models on two distinct branches. This training strategy is designed to effectively promote the rapid acquisition of new styles by the client models, and guarantees the continuous and thorough utilization of positive styles, which is highly beneficial for the model’s generalization performance. Extensive experimental results demonstrate that our method outperforms existing methods in both the source domain and the target domain.

[116] Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

Chao Zhou, Tianyi Wei, Nenghai Yu

Main category: cs.CV

TL;DR: This paper addresses text instruction neglect in unified image generation models like OmniGen by proposing Self-Adaptive Attention Scaling (SaaS), which dynamically scales attention activation for each sub-instruction to improve instruction-following fidelity without additional training.

Details

Motivation: Unified image generation models suffer from text instruction neglect, particularly when handling multiple sub-instructions, despite their advantages in simplifying architecture and standardizing tasks. The authors identified conflicts between neglected sub-instructions and input image activations through perturbation analysis and cross-attention map examination.

Method: The authors propose Self-Adaptive Attention Scaling (SaaS), which leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. The method includes perturbation analysis to identify critical steps and layers, and examination of cross-attention maps to understand instruction conflicts.

Result: Experimental results on instruction-based image editing and visual conditional image generation demonstrate that SaaS achieves superior instruction-following fidelity compared to existing methods, while requiring no additional training or test-time optimization.

Conclusion: SaaS effectively addresses the text instruction neglect problem in unified image generation models by dynamically scaling attention activations, improving instruction-following performance without computational overhead or model retraining requirements.

Abstract: Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available https://github.com/zhouchao-ops/SaaS.

[117] HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

Yu Wang, Bo Dang, Wanchun Li, Wei Chen, Yansheng Li

Main category: cs.CV

TL;DR: HoliTracer is the first framework designed to extract vectorized geographic objects from large-size remote sensing imagery, addressing the limitations of existing methods that process small patches and lose contextual information.

Details

Motivation: Existing methods for vector mapping from remote sensing imagery are constrained to processing small image patches, leading to loss of contextual information and fragmented vector outputs. With increasing resolution of remote sensing imagery, there's a need for methods that can handle large-size images holistically.

Method: HoliTracer uses three key components: (1) Context Attention Net (CAN) with local-to-global attention mechanism for enhanced segmentation of large-size imagery, (2) Mask Contour Reformer (MCR) for polygon reconstruction, and (3) Polygon Sequence Tracer (PST) for vertex tracing in a robust vectorization pipeline.

Result: Extensive experiments on large-size remote sensing imagery datasets including buildings, water bodies, and roads demonstrate that HoliTracer outperforms state-of-the-art methods in vectorized geographic object extraction.

Conclusion: HoliTracer successfully addresses the challenge of holistic vector mapping from large-size remote sensing imagery by preserving contextual information and producing coherent vector outputs, outperforming existing methods across multiple geographic object types.

Abstract: With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.

[118] Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective

Seunghyeon Kim, Kyeongryeol Go

Main category: cs.CV

TL;DR: A data-centric approach that improves fisheye camera object detection by identifying model blind spots through error analysis and addressing them via synthetic image generation with edge-case synthesis.

Details

Motivation: Fisheye cameras introduce significant distortion that creates unique challenges for object detection models trained on conventional datasets, requiring specialized approaches to handle peripheral distortions and edge cases.

Method: 1) Detailed error analysis to identify critical edge-cases (confusing class pairs, peripheral distortions, underrepresented contexts), 2) Fine-tuning an image generative model with carefully crafted prompts to synthesize images replicating real-world failure modes, 3) Pseudo-labeling synthetic images using high-quality detectors and integrating them into training.

Result: The approach achieves consistent performance gains in fisheye object detection, demonstrating the effectiveness of systematically addressing model weaknesses through targeted data augmentation.

Conclusion: Understanding data deeply and selectively fixing its weaknesses through edge-case synthesis can be highly impactful in specialized domains like fisheye object detection, showing the value of data-centric approaches over model-centric solutions.

Abstract: Fisheye cameras introduce significant distortion and pose unique challenges to object detection models trained on conventional datasets. In this work, we propose a data-centric pipeline that systematically improves detection performance by focusing on the key question of identifying the blind spots of the model. Through detailed error analysis, we identify critical edge-cases such as confusing class pairs, peripheral distortions, and underrepresented contexts. Then we directly address them through edge-case synthesis. We fine-tuned an image generative model and guided it with carefully crafted prompts to produce images that replicate real-world failure modes. These synthetic images are pseudo-labeled using a high-quality detector and integrated into training. Our approach results in consistent performance gains, highlighting how deeply understanding data and selectively fixing its weaknesses can be impactful in specialized domains like fisheye object detection.

[119] Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models

Futa Waseda, Saku Sugawara, Isao Echizen

Main category: cs.CV

TL;DR: This paper proposes QT-AFT (Quality Text-guided Adversarial Fine-Tuning), a method that uses high-quality captions to improve adversarial robustness of vision-language models like CLIP by guiding adversarial training with rich semantic information rather than just class labels.

Details

Motivation: Existing adversarial training methods for vision-language models have significant limitations: supervised methods overfit to training classes by using short texts like class labels, while unsupervised methods lack semantic guidance and perform poorly against text-guided attacks. There's a need for better adversarial training that leverages language to enhance visual robustness across diverse zero-shot tasks.

Method: QT-AFT (Quality Text-guided Adversarial Fine-Tuning) leverages high-quality captions during adversarial training to guide adversarial examples away from diverse semantics present in images. This approach enables the visual encoder to robustly recognize broader image features under adversarial noise by using rich linguistic supervision instead of simple class labels.

Result: QT-AFT achieves state-of-the-art zero-shot adversarial robustness and clean accuracy across 16 zero-shot datasets. The method successfully overcomes the overfitting issues of supervised adversarial training and the lack of semantic awareness in unsupervised methods. The study reveals that describing object properties in addition to names further enhances robustness.

Conclusion: The paper demonstrates that high-quality linguistic supervision is crucial for robust visual representation learning. QT-AFT effectively bridges the gap between supervised and unsupervised adversarial training by using rich captions, pointing to the importance of centering quality language guidance in future robust vision-language model development.

Abstract: Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods – overfitting in supervised AT and lack of semantic awareness in unsupervised AT – achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work – centering high-quality linguistic supervision in robust visual representation learning.

[120] ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

Haoyue Zhang, Jie Zhang, Song Guo

Main category: cs.CV

TL;DR: This paper proposes ToFe (Token Freezing and Reusing), a framework that temporarily freezes less important tokens in vision transformers instead of discarding them, allowing their reuse in later blocks to reduce computational cost by 50% while maintaining performance.

Details

Motivation: Existing token reduction methods for vision transformers irreversibly discard unimportant tokens, but these tokens might be useful in later transformer blocks since different blocks focus on different information. There's also a need to balance model performance with computational overhead for deployment on resource-constrained devices.

Method: The ToFe framework identifies important tokens at each stage and temporarily freezes unimportant ones for later reuse. It includes a prediction module for token identification and an approximate module for recovering frozen tokens. The system is jointly optimized with the backbone through computation budget-aware end-to-end training.

Result: ToFe reduces computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving better trade-off between performance and complexity compared to state-of-the-art methods.

Conclusion: The ToFe framework successfully addresses the limitations of existing token reduction methods by enabling token reuse, demonstrating that temporarily freezing tokens instead of discarding them can significantly reduce computational costs while maintaining competitive performance in vision transformer models.

Abstract: Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.

[121] MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks

Junhao Su, Feiyu Zhu, Hengyu Shi, Tianyang Han, Yurui Qiu, Junfeng Luo, Xiaoming Wei, Jialin Gao

Main category: cs.CV

TL;DR: The paper proposes MAN++ (Momentum Auxiliary Network++), a supervised local learning method that uses Exponential Moving Average (EMA) and learnable scaling bias to enable communication between network blocks, achieving performance comparable to end-to-end backpropagation while reducing GPU memory usage.

Details

Motivation: Traditional end-to-end backpropagation suffers from update locking, high GPU memory consumption, and lack of biological plausibility. Existing supervised local learning methods partition networks into blocks but suffer performance degradation due to limited inter-block information flow, preventing them from replacing end-to-end training.

Method: MAN++ introduces a dynamic interaction mechanism using Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance inter-block communication. An auxiliary network updated via EMA bridges information gaps between blocks. A learnable scaling bias is added to balance feature discrepancies between local blocks and optimize performance.

Result: Extensive experiments on image classification, object detection, and image segmentation across multiple network architectures show that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage.

Conclusion: MAN++ offers a viable alternative to conventional training methods by providing a novel perspective for supervised local learning that maintains competitive performance while addressing the memory and computational limitations of end-to-end backpropagation.

Abstract: Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.

[122] Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Zefeng Qian, Xincheng Yao, Yifei Huang, Chongyang Zhang, Jiangyong Ying, Hong Sun

Main category: cs.CV

TL;DR: The paper proposes Language-Guided Action Anatomy (LGA), a framework that uses Large Language Models to break down action labels into atomic components and segments videos accordingly, achieving state-of-the-art performance in few-shot action recognition by capturing fine-grained spatiotemporal features.

Details

Motivation: Few-shot action recognition suffers from limited training data, and existing approaches using additional text modalities cannot fully exploit the subtle variations in human posture, motion dynamics, and object interactions that are critical for understanding actions beyond simple label semantics.

Method: The method uses LLMs to anatomize action labels into atomic action descriptions focusing on subject, motion, and object elements. A Visual Anatomy Module segments videos into atomic phases. Fine-grained fusion integrates textual and visual features at the atomic level, and a Multimodal Matching mechanism performs both video-video and video-text matching for classification.

Result: LGA achieves state-of-the-art performance across multiple few-shot action recognition benchmarks, demonstrating improved ability to capture rich spatiotemporal cues and create more generalizable prototypes in few-shot scenarios.

Conclusion: The Language-Guided Action Anatomy framework successfully leverages LLM knowledge to dissect action representations at an atomic level, enabling better few-shot action recognition by capturing fine-grained spatiotemporal features that go beyond simple label semantics.

Abstract: Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object). For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions. A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multipe FSAR benchmarks.

[123] Dens3R: A Foundation Model for 3D Geometry Prediction

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, Chengfei Lyu

Main category: cs.CV

TL;DR: Dens3R is a 3D foundation model that jointly predicts multiple geometric properties (depth, surface normals, point maps) from images using a unified framework, achieving better consistency and accuracy than methods that predict these quantities in isolation.

Details

Motivation: Existing 3D reconstruction methods typically predict single geometric quantities in isolation, which fails to ensure consistency between inherently correlated geometric properties like depth, surface normals, and point maps, limiting both accuracy and practical applicability.

Method: A two-stage training framework with a lightweight shared encoder-decoder backbone, position-interpolated rotary positional encoding for high-resolution robustness, integration of image-pair matching features with intrinsic invariance modeling, and a post-processing pipeline for geometrically consistent multi-view inference.

Result: Superior performance across various dense 3D prediction tasks, demonstrating accurate joint regression of multiple geometric quantities and consistent geometry perception from single-view to multi-view inputs.

Conclusion: Dens3R successfully addresses the limitation of existing methods by explicitly modeling structural coupling among different geometric properties, enabling unified and consistent 3D geometric prediction with broad application potential.

Abstract: Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.

[124] MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Yanchen Liu, Yanan Sun, Zhening Xing, Junyao Gao, Kai Chen, Wenjie Pei

Main category: cs.CV

TL;DR: MotionShot is a training-free framework that enables high-fidelity motion transfer between objects with significant appearance or structural differences by using semantic feature matching and morphological alignment techniques.

Details

Motivation: Existing text-to-video methods struggle to smoothly transfer motion from a reference object to a target object when there are significant differences in appearance or structure between them, creating a need for better motion transfer capabilities.

Method: MotionShot uses a two-stage approach: (1) semantic feature matching to ensure high-level alignments between reference and target objects, and (2) reference-to-target shape retargeting to establish low-level morphological alignments. Motion is encoded using temporal attention to enable coherent transfer across disparate objects.

Result: Extensive experiments demonstrate that MotionShot can coherently transfer motion across objects even when there are significant appearance and structure disparities between the reference and target objects.

Conclusion: MotionShot successfully addresses the challenge of motion transfer between dissimilar objects by parsing reference-target correspondences in a fine-grained manner, achieving high-fidelity motion transfer while preserving appearance coherence in a training-free framework.

Abstract: Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. The project page is available at: https://motionshot.github.io/.

[125] M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

Kailai Zhou, Fuqiang Yang, Shixian Wang, Bihan Wen, Chongde Zi, Linsen Chen, Qiu Shen, Xun Cao

Main category: cs.CV

TL;DR: This paper introduces M-SpecGene, a generalized RGBT multispectral foundation model that learns modality-invariant representations through self-supervised learning, addressing limitations of existing case-by-case research paradigms in RGBT vision tasks.

Details

Motivation: Current RGBT vision research follows a case-by-case paradigm with manually customized models that suffer from artificial inductive bias, modality bias, and data bottleneck limitations. There's a need for a unified foundation model that can generalize across multiple RGBT tasks.

Method: The authors develop M-SpecGene using: (1) Cross-Modality Structural Sparsity (CMSS) metric to quantify information density across RGB and thermal modalities, and (2) GMM-CMSS progressive masking strategy for flexible, easy-to-hard, object-centric pre-training in a self-supervised manner on large-scale broad data.

Result: Comprehensive experiments demonstrate M-SpecGene’s generalizability across eleven datasets covering four different RGBT downstream tasks, validating the effectiveness of the proposed foundation model approach.

Conclusion: M-SpecGene successfully addresses the limitations of existing RGBT research paradigms by providing a unified foundation model that learns modality-invariant representations, offering new insights into multispectral fusion and integrating prior case-by-case studies into a generalized framework.

Abstract: RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene’s generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.

[126] Scene Text Detection and Recognition “in light of” Challenging Environmental Conditions using Aria Glasses Egocentric Vision Cameras

Joseph De Mathia, Carlos Francisco Moreno-García

Main category: cs.CV

TL;DR: This paper investigates Scene Text Detection and Recognition (STDR) performance using Meta’s Project Aria smart glasses, analyzing how environmental factors affect OCR algorithms and introducing image upscaling as a key preprocessing technique to improve accuracy.

Details

Motivation: With the rise of wearable technology and egocentric vision applications, there is a need to understand how real-world environmental conditions affect text recognition performance in AR/smart glasses scenarios, particularly for assistive and research applications like asset inspection and nutrition analysis.

Method: The researchers created a custom dataset using Meta’s Project Aria smart glasses under controlled conditions and evaluated two OCR pipelines (EAST with CRNN, and EAST with PyTesseract) while systematically varying environmental variables including lighting, distance, and resolution. They also explored image upscaling as preprocessing and integrated eye-gaze tracking for processing optimization.

Result: Resolution and distance significantly impact recognition accuracy, while lighting has less predictable effects. Image upscaling as preprocessing reduced Character Error Rate (CER) from 0.65 to 0.48. Eye-gaze tracking showed potential for optimizing processing efficiency by focusing on user attention zones.

Conclusion: The study successfully benchmarks STDR performance under realistic wearable conditions and establishes groundwork for adaptive, user-aware AR systems. The findings provide insights for developing robust, context-sensitive text recognition systems for future assistive and research applications.

Abstract: In an era where wearable technology is reshaping applications, Scene Text Detection and Recognition (STDR) becomes a straightforward choice through the lens of egocentric vision. Leveraging Meta’s Project Aria smart glasses, this paper investigates how environmental variables, such as lighting, distance, and resolution, affect the performance of state-of-the-art STDR algorithms in real-world scenarios. We introduce a novel, custom-built dataset captured under controlled conditions and evaluate two OCR pipelines: EAST with CRNN, and EAST with PyTesseract. Our findings reveal that resolution and distance significantly influence recognition accuracy, while lighting plays a less predictable role. Notably, image upscaling emerged as a key pre-processing technique, reducing Character Error Rate (CER) from 0.65 to 0.48. We further demonstrate the potential of integrating eye-gaze tracking to optimise processing efficiency by focusing on user attention zones. This work not only benchmarks STDR performance under realistic conditions but also lays the groundwork for adaptive, user-aware AR systems. Our contributions aim to inspire future research in robust, context-sensitive text recognition for assistive and research-oriented applications, such as asset inspection and nutrition analysis. The code is available at https://github.com/josepDe/Project_Aria_STR.

[127] One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

Xinyu Mao, Xiaohan Xing, Fei Meng, Jianbang Liu, Fan Bai, Qiang Nie, Max Meng

Main category: cs.CV

TL;DR: OP-SAM is a one-shot polyp segmentation framework that automatically generates prompts from a single annotated image using SAM, achieving state-of-the-art performance with 76.93% IoU on Kvasir dataset while eliminating the need for manual prompt input or extensive annotations.

Details

Motivation: Traditional polyp segmentation methods struggle with morphological variability and domain shifts, requiring frequent retraining and large-scale annotations. While SAM shows promise for polyp segmentation, its prompt-dependent nature limits automation in medical applications due to the labor-intensive manual prompt input requirement.

Method: The paper proposes OP-SAM with three key components: (1) Correlation-based Prior Generation (CPG) for semantic label transfer from a single annotated image, (2) Scale-cascaded Prior Fusion (SPF) to handle polyp size variations and filter noisy transfers, and (3) Euclidean Prompt Evolution (EPE) for iterative prompt refinement instead of using all prompts simultaneously.

Result: OP-SAM achieves 76.93% IoU on the Kvasir dataset, surpassing the previous state-of-the-art by 11.44%. The method demonstrates effectiveness across five datasets, providing accurate and generalizable polyp segmentation without requiring additional annotation burdens.

Conclusion: OP-SAM successfully addresses the automation challenge of SAM-based polyp segmentation by enabling automatic prompt generation from a single annotated image, achieving superior performance while maintaining generalizability and reducing annotation requirements for medical polyp detection applications.

Abstract: Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM’s prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM’s effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.

[128] Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Mingtao Guo, Guanyu Xing, Yanci Zhang, Yanli Liu

Main category: cs.CV

TL;DR: FRVD introduces a novel face reenactment framework that combines implicit facial keypoints extraction, motion alignment through warping, and a Warping Feature Mapper to leverage pretrained image-to-video model priors for high-fidelity talking head generation under large pose variations.

Details

Motivation: Existing face reenactment methods struggle with large pose variations due to warping artifacts when using implicit keypoints or limitations of coarse facial landmarks when using explicit keypoints, creating a need for better handling of extreme pose changes while preserving identity and motion fidelity.

Method: The method employs a motion extractor to extract implicit facial keypoints from source and driving images, performs motion alignment through a warping module, and introduces a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video model to correct warping degradation and enhance temporal coherence.

Result: FRVD achieves superior performance compared to existing methods in pose accuracy, identity preservation, and visual quality, particularly excelling in challenging scenarios with extreme pose variations, as demonstrated through extensive experiments.

Conclusion: The Face Reenactment Video Diffusion model successfully addresses the limitations of existing face reenactment approaches by leveraging pretrained video model priors and warping feature mapping, enabling high-fidelity talking head generation even under large pose changes while maintaining identity preservation and temporal coherence.

Abstract: Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.

[129] Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video

Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari

Main category: cs.CV

TL;DR: This paper introduces Mamba-OTR, a Mamba architecture-based model for online detection of object take and release actions in egocentric videos, achieving superior performance (45.48 mp-mAP) compared to transformer and vanilla Mamba baselines while being computationally efficient for real-world deployment.

Details

Motivation: The task of online detection of take and release actions in untrimmed egocentric videos faces significant challenges including severe label imbalance with temporally sparse positive annotations, the need for precise temporal predictions, and computational efficiency requirements for real-world online deployment.

Method: The authors propose Mamba-OTR based on the Mamba architecture, designed to exploit temporal recurrence during inference while being trained on short video clips. The training pipeline incorporates focal loss to address label imbalance and a novel regularization scheme that aligns model predictions with the evaluation metric.

Result: Mamba-OTR achieves 45.48 mp-mAP in sliding-window mode and 43.35 mp-mAP in streaming mode on EPIC-KITCHENS-100 dataset, significantly outperforming vanilla transformer (20.32 mp-mAP) and vanilla Mamba (25.16 mp-mAP) baselines. The model demonstrates superior performance in both accuracy and efficiency, particularly when evaluating full-length videos or high frame-rate sequences.

Conclusion: Mamba-OTR provides a strong baseline for online take and release detection, demonstrating that Mamba architecture can effectively handle temporal modeling in egocentric video understanding while maintaining computational efficiency. The proposed method successfully addresses the challenges of label imbalance and temporal precision in online video analysis.

Abstract: This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.

[130] LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network

Guangzhu Xu, Pengcheng Zuo, Zhi Ke, Bangjun Lei

Main category: cs.CV

TL;DR: The paper proposes LPTR-AFLNet, a lightweight unified network that simultaneously corrects perspective distortion and recognizes Chinese license plates in real-time (under 10ms) on edge devices, achieving high accuracy across challenging scenarios.

Details

Motivation: Chinese License Plate Recognition faces significant challenges in unconstrained environments due to perspective distortions from various shooting angles and the complexity of handling both single-line and double-line license plates. Edge devices have limited computational resources, requiring efficient solutions for real-time deployment.

Method: The authors develop LPTR-AFLNet, which combines a perspective transformation correction module (PTR) with an optimized recognition network AFLNet. The method uses recognition output as weak supervision to guide correction, incorporates an improved attention module to reduce character confusion, and employs Focal Loss to handle class imbalance during training.

Result: LPTR-AFLNet demonstrates exceptional performance in correcting perspective distortion and recognizing double-line license plates while maintaining high accuracy across various challenging scenarios. The method achieves processing times under 10 milliseconds on lower-mid-range GPU platforms.

Conclusion: The proposed LPTR-AFLNet successfully addresses the dual challenges of perspective correction and license plate recognition in a single lightweight network, making it practical for real-time deployment on edge devices with limited computational resources while maintaining high accuracy.

Abstract: Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.

[131] GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration

Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei

Main category: cs.CV

TL;DR: This paper proposes GPI-Net, a Gestalt-guided parallel interaction network that uses orthogonal geometric consistency to improve point cloud registration by better fusing local and global features through Gestalt principles.

Details

Motivation: Point cloud registration requires accurate identification of high-quality correspondences, but existing methods struggle with fusing local and global features due to feature redundancy and complex spatial relationships. The authors aim to leverage Gestalt principles to better analyze local-global relationships.

Method: The method introduces GPI-Net with three key components: (1) an orthogonal integration strategy to reduce redundant information and create compact global structures, (2) a Gestalt Feature Attention (GFA) block using hybrid self-attention and cross-attention mechanisms, and (3) a Dual-path Multi-Granularity (DMG) block for parallel interaction aggregation across different granularities.

Result: Extensive experiments on various challenging tasks demonstrate superior performance of GPI-Net compared to existing methods in point cloud registration.

Conclusion: The proposed GPI-Net effectively addresses the challenge of feature fusion in point cloud registration by applying Gestalt principles, achieving better performance through orthogonal geometric consistency and multi-granularity parallel interaction mechanisms.

Abstract: The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk/GPI-Net.

[132] STAR: A Benchmark for Astronomical Star Fields Super-Resolution

Kuo-Cheng Wu, Guohang Zhuang, Jinyang Huang, Xiang Zhang, Wanli Ouyang, Yan Lu

Main category: cs.CV

TL;DR: This paper introduces STAR, a large-scale astronomical super-resolution dataset with 54,738 flux-consistent star field image pairs, and proposes a Flux-Invariant Super Resolution (FISR) model that outperforms existing methods by 24.84% on flux consistency metrics for astronomical imaging applications.

Details

Motivation: Existing astronomical super-resolution datasets have three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, which significantly impede the development of astronomical super-resolution methods. There is a need for cost-effective high-resolution astronomical imaging to detect faraway celestial objects and enable precise structural analysis.

Method: The authors create STAR dataset using Hubble Space Telescope high-resolution observations paired with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline. They propose a Flux-Invariant Super Resolution (FISR) model designed to accurately infer flux-consistent high-resolution images from input photometry. They also introduce a novel Flux Error (FE) metric to evaluate SR models from a physical perspective.

Result: The STAR dataset contains 54,738 flux-consistent star field image pairs covering wide celestial regions. The proposed FISR model outperforms several state-of-the-art super-resolution methods by 24.84% on the novel flux consistency metric. Extensive experiments demonstrate the effectiveness of the proposed method and validate the value of the dataset.

Conclusion: The paper successfully addresses the limitations of existing astronomical super-resolution datasets by providing STAR, a comprehensive dataset with flux-consistent image pairs. The proposed FISR model shows superior performance in maintaining flux consistency, which is crucial for astrophysical applications. The work enables systematic development of field-level astronomical super-resolution models and provides valuable resources for the astronomical imaging community.

Abstract: Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.

[133] From Flat to Round: Redefining Brain Decoding with Surface-Based fMRI and Cortex Structure

Sijin Yu, Zijiao Chen, Wenxuan Wu, Shengxian Chen, Zhongliang Liu, Jingxin Nie, Xiaofen Xing, Xiangmin Xu, Xin Zhang

Main category: cs.CV

TL;DR: This paper proposes a novel approach for reconstructing visual stimuli from brain fMRI data by using sphere tokenization to preserve spatial cortical structure, integrating structural MRI for personalized anatomical encoding, and employing positive-sample mixup to efficiently use multiple scans, achieving superior reconstruction performance over existing methods.

Details

Motivation: Existing methods for reconstructing visual stimuli from fMRI data have significant limitations: they flatten spatial information, overlook critical brain structure-function relationships, and neglect individual anatomical variations. These issues reduce reconstruction accuracy and biological interpretability.

Method: The paper introduces three key innovations: (1) a sphere tokenizer that models fMRI signals as spatially coherent 2D spherical data on the cortical surface, (2) integration of structural MRI (sMRI) data to enable personalized encoding of individual anatomical variations, and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus.

Result: Experiments demonstrate superior reconstruction performance compared to state-of-the-art (SOTA) methods, showing enhanced reconstruction accuracy, biological interpretability, and generalizability across individuals.

Conclusion: The biologically informed approach successfully addresses the limitations of existing methods by preserving spatial cortical structure and incorporating individual anatomical differences, leading to more accurate and interpretable visual stimulus reconstruction from brain activity.

Abstract: Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.

[134] Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection?

Lazaro Janier Gonzalez-Sole, Juan E. Tapia, Christoph Busch

Main category: cs.CV

TL;DR: This paper proposes a zero-shot presentation attack detection (PAD) framework using foundation models to protect face recognition systems against spoofing attacks without requiring extensive training data, achieving superior performance compared to state-of-the-art methods on challenging datasets.

Details

Motivation: Face recognition systems are vulnerable to presentation attacks that are easy to create and execute. Current deep learning PAD approaches require large amounts of training data and lack generalizability to unknown attack types or databases, limiting their real-world effectiveness.

Method: The authors propose a zero-shot PAD framework that leverages foundation models to detect presentation attacks without requiring extensive training on attack data. They assess the effectiveness and generalizability of foundation models in established and challenging experimental scenarios.

Result: Foundation models achieve competitive performance in difficult scenarios with minimal effort compared to advanced PAD mechanisms trained on attack datasets. The top-performing foundation model significantly outperforms state-of-the-art methods using the leaving-one-out protocol on the SiW-Mv2 database containing challenging 2D and 3D attacks.

Conclusion: Foundation models demonstrate strong potential for zero-shot presentation attack detection, offering better generalizability and performance than traditional PAD approaches while requiring minimal training effort, making them promising for protecting face recognition systems against unknown attack types.

Abstract: Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system’s capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter’s information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state-of-the-art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero-shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero-shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top-performing foundation model outperforms by a margin the best from the state of the art observed with the leaving-one-out protocol on the SiW-Mv2 database, which contains challenging unknown 2D and 3D attacks

[135] ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

Kahim Wong, Jicheng Zhou, Haiwei Wu, Yain-Whar Si, Jiantao Zhou

Main category: cs.CV

TL;DR: ADCD-Net is a robust document forgery detection model that adaptively uses RGB/DCT forensic traces and addresses document-specific challenges like text-background disparities and DCT sensitivity to misalignment, achieving 20.79% improvement over state-of-the-art methods across various distortions.

Details

Motivation: Existing forgery detectors for natural images struggle with document images due to seamless blending of tampered regions into uniform backgrounds and structured text. Current document-specific methods lack robustness against various degradations like resizing and cropping, limiting their practical deployment.

Method: ADCD-Net adaptively modulates DCT feature contribution based on predicted alignment scores to handle block misalignment sensitivity. It uses hierarchical content disentanglement to address text-background disparities and constructs pristine prototypes that capture traces of untampered regions to enhance localization accuracy.

Result: ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79% averaged over 5 types of distortions, showing improved resilience to various degradations including resizing and cropping.

Conclusion: The proposed ADCD-Net successfully addresses key challenges in document image forgery detection by adaptively leveraging forensic traces and incorporating document-specific characteristics, resulting in significantly improved robustness and localization accuracy compared to existing methods.

Abstract: The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for robust document image forgery detection.Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document background (BG) and structured text. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces’ sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-BG disparities. Furthermore, noticing the predominantly pristine nature of BG regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79% averaged over 5 types of distortions. The code is available at https://github.com/KAHIMWONG/ACDC-Net.

[136] ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Thuy-Duong Tran, Trung-Kien Tran, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: The paper introduces ReasonVQA, a large-scale Visual Question Answering dataset that automatically integrates encyclopedic knowledge to generate complex, multi-hop questions, demonstrating significant challenges for current VQA models.

Details

Motivation: Current VQA datasets lack the complexity and scale needed to properly evaluate models' reasoning capabilities, particularly for questions requiring external knowledge and multi-hop reasoning.

Method: The authors developed a low-cost framework that automatically integrates structured encyclopedic knowledge to construct complex, multi-hop questions for Visual Question Answering, making the dataset easily scalable with respect to input images.

Result: The dataset surpasses existing knowledge-requiring VQA datasets by more than an order of magnitude in size. Evaluation of state-of-the-art VQA models shows that ReasonVQA poses significant challenges to current approaches.

Conclusion: ReasonVQA successfully addresses the need for a large-scale, challenging VQA dataset with complex reasoning requirements, providing a valuable benchmark for advancing VQA research and highlighting current model limitations.

Abstract: In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.

[137] Sparse-View 3D Reconstruction: Recent Advances and Open Challenges

Tanveer Younis, Zhanglin Cheng

Main category: cs.CV

TL;DR: This survey reviews recent advances in sparse-view 3D reconstruction methods, comparing neural implicit models (NeRF variants), explicit point-cloud approaches (3D Gaussian Splatting), and hybrid frameworks that use diffusion and vision foundation models to overcome limitations of traditional methods when minimal image overlap is available.

Details

Motivation: Traditional 3D reconstruction methods like structure-from-motion (SfM) and multiview stereo (MVS) fail in sparse-view scenarios with minimal image overlap, which are common in robotics, AR/VR, and autonomous systems where dense image acquisition is impractical.

Method: The survey analyzes three main approaches: (1) neural implicit models including NeRF and regularized versions, (2) explicit point-cloud-based methods like 3D Gaussian Splatting, and (3) hybrid frameworks leveraging priors from diffusion and vision foundation models. The analysis focuses on geometric regularization, explicit shape modeling, and generative inference techniques.

Result: Comparative analysis on standard benchmarks reveals key trade-offs between reconstruction accuracy, efficiency, and generalization across different methods. The survey identifies how these approaches mitigate common artifacts like floaters and pose ambiguities in sparse-view settings.

Conclusion: The survey highlights persistent challenges in domain generalization and pose-free reconstruction, outlining future directions toward developing 3D-native generative priors and achieving real-time, unconstrained sparse-view reconstruction. It provides a unified perspective bridging geometry-based, neural implicit, and generative methods.

Abstract: Sparse-view 3D reconstruction is essential for applications in which dense image acquisition is impractical, such as robotics, augmented/virtual reality (AR/VR), and autonomous systems. In these settings, minimal image overlap prevents reliable correspondence matching, causing traditional methods, such as structure-from-motion (SfM) and multiview stereo (MVS), to fail. This survey reviews the latest advances in neural implicit models (e.g., NeRF and its regularized versions), explicit point-cloud-based approaches (e.g., 3D Gaussian Splatting), and hybrid frameworks that leverage priors from diffusion and vision foundation models (VFMs).We analyze how geometric regularization, explicit shape modeling, and generative inference are used to mitigate artifacts such as floaters and pose ambiguities in sparse-view settings. Comparative results on standard benchmarks reveal key trade-offs between the reconstruction accuracy, efficiency, and generalization. Unlike previous reviews, our survey provides a unified perspective on geometry-based, neural implicit, and generative (diffusion-based) methods. We highlight the persistent challenges in domain generalization and pose-free reconstruction and outline future directions for developing 3D-native generative priors and achieving real-time, unconstrained sparse-view reconstruction.

[138] C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang

Main category: cs.CV

TL;DR: C2-Evo is a closed-loop self-improving framework that jointly evolves training data and model capabilities for multimodal large language models, addressing the problem of mismatched complexity between visual and textual data through cross-modal data evolution and adaptive model training.

Details

Motivation: Existing MLLMs require high-quality vision-language datasets with carefully curated task complexities that are costly and hard to scale. Current self-improving methods suffer from: (1) separate augmentation of visual/textual data causing complexity mismatches, and (2) separated evolution of data and models leading to difficulty level mismatches during training.

Method: C2-Evo employs two evolution loops: (1) Cross-modal data evolution loop that generates complex multimodal problems combining structured textual sub-problems with iteratively specified geometric diagrams, and (2) Data-model evolution loop that adaptively selects generated problems based on base model performance and conducts alternating supervised fine-tuning and reinforcement learning.

Result: The method continuously refines both model and training data, achieving considerable performance gains across multiple mathematical reasoning benchmarks through the joint evolution approach.

Conclusion: C2-Evo successfully addresses the challenges in MLLM improvement by automatically evolving both data and model capabilities in a closed-loop manner, demonstrating effectiveness in mathematical reasoning tasks and offering a scalable solution for enhancing multimodal AI systems.

Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

[139] Towards Railway Domain Adaptation for LiDAR-based 3D Detection: Road-to-Rail and Sim-to-Real via SynDRA-BBox

Xavier Diaz, Gianluca D’Amico, Raul Dominguez-Sanchez, Federico Nesti, Max Ronecker, Giorgio Buttazzo

Main category: cs.CV

TL;DR: Researchers created SynDRA-BBox, the first synthetic dataset for railway object detection, addressing the lack of real-world annotated railway data. They adapted automotive domain adaptation methods for railway contexts and demonstrated promising results in transferring synthetic data knowledge to real-world railway perception tasks.

Details

Motivation: The railway sector lacks publicly available real-world annotated datasets, making it challenging to test and validate new perception solutions for automatic train operations. This gap hinders the development of robust vision-based algorithms essential for advanced railway functionalities.

Method: The authors introduce SynDRA-BBox, a synthetic dataset for 2D and 3D object detection in railway scenarios. They adapt a state-of-the-art semi-supervised domain adaptation method, originally developed for automotive perception, to the railway context to enable transferability of synthetic data to real-world railway object detection tasks.

Result: Experimental results demonstrate promising performance in railway object detection tasks. The study shows the effectiveness of using synthetic datasets combined with domain adaptation techniques for advancing perception capabilities in railway environments, successfully transferring knowledge from synthetic to real-world scenarios.

Conclusion: SynDRA-BBox successfully addresses the data scarcity problem in railway perception research by providing the first synthetic dataset specifically designed for railway object detection. The combination of synthetic data generation and domain adaptation techniques proves effective for developing robust vision-based algorithms for automatic train operations, opening new possibilities for railway perception research.

Abstract: In recent years, interest in automatic train operations has significantly increased. To enable advanced functionalities, robust vision-based algorithms are essential for perceiving and understanding the surrounding environment. However, the railway sector suffers from a lack of publicly available real-world annotated datasets, making it challenging to test and validate new perception solutions in this domain. To address this gap, we introduce SynDRA-BBox, a synthetic dataset designed to support object detection and other vision-based tasks in realistic railway scenarios. To the best of our knowledge, is the first synthetic dataset specifically tailored for 2D and 3D object detection in the railway domain, the dataset is publicly available at https://syndra.retis.santannapisa.it. In the presented evaluation, a state-of-the-art semi-supervised domain adaptation method, originally developed for automotive perception, is adapted to the railway context, enabling the transferability of synthetic data to 3D object detection. Experimental results demonstrate promising performance, highlighting the effectiveness of synthetic datasets and domain adaptation techniques in advancing perception capabilities for railway environments.

[140] Combined Image Data Augmentations diminish the benefits of Adaptive Label Smoothing

Georg Siedel, Ekagra Gupta, Weijia Shao, Silvia Vock, Andrey Morozov

Main category: cs.CV

TL;DR: This paper extends adaptive label smoothing (soft augmentation) from random-crop to other aggressive augmentations like random erasing and noise injection, finding it effective for limited transformation types but harmful when used with diverse augmentation methods or excessive smoothing.

Details

Motivation: The motivation is to extend the soft augmentation framework beyond random-crop augmentation to other types of aggressive data augmentations, investigating whether adaptive label smoothing can improve regularization for a broader range of transformation techniques.

Method: The method extends adaptive label smoothing by reducing label confidence based on the magnitude of augmentation applied, specifically testing this approach on random erasing and noise injection augmentations rather than just random-crop transformations.

Result: The results show that adaptive label smoothing is effective for random erasing and noise injection when used individually, enabling stronger regularization with higher-intensity Random Erasing. However, benefits disappear when combined with diverse transformations like TrivialAugment, and excessive smoothing reduces robustness to common corruptions.

Conclusion: The conclusion is that adaptive label smoothing should only be applied when training data distribution is dominated by a limited, homogeneous set of image transformation types, as it becomes ineffective or harmful with diverse augmentation strategies.

Abstract: Soft augmentation regularizes the supervised learning process of image classifiers by reducing label confidence of a training sample based on the magnitude of random-crop augmentation applied to it. This paper extends this adaptive label smoothing framework to other types of aggressive augmentations beyond random-crop. Specifically, we demonstrate the effectiveness of the method for random erasing and noise injection data augmentation. Adaptive label smoothing permits stronger regularization via higher-intensity Random Erasing. However, its benefits vanish when applied with a diverse range of image transformations as in the state-of-the-art TrivialAugment method, and excessive label smoothing harms robustness to common corruptions. Our findings suggest that adaptive label smoothing should only be applied when the training data distribution is dominated by a limited, homogeneous set of image transformation types.

[141] Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum

Main category: cs.CV

TL;DR: The paper introduces Zebra-CoT, a large-scale dataset of 182,384 samples for training multimodal models to use visual aids in reasoning, achieving significant performance improvements on visual chain-of-thought tasks.

Details

Motivation: Current multimodal models struggle with Visual Chain of Thought reasoning due to poor off-the-shelf performance that hinders reinforcement learning and lack of high-quality visual CoT training data. Humans naturally use visual aids like diagrams when solving complex problems, so models should learn similar capabilities.

Method: Created Zebra-CoT dataset with 182,384 samples containing logically coherent interleaved text-image reasoning traces across four task categories: scientific questions (geometry, physics, algorithms), 2D visual reasoning (visual search, jigsaw puzzles), 3D reasoning (multi-hop inference, robot planning), and visual logic problems (chess). Fine-tuned Anole-7B and Bagel-7B models on this dataset.

Result: Fine-tuning Anole-7B on Zebra-CoT achieved +12% improvement in test-set accuracy and up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B produced a model capable of generating high-quality interleaved visual reasoning chains.

Conclusion: Zebra-CoT effectively enables development of multimodal reasoning abilities in language models. The dataset and trained models are open-sourced to support further research and evaluation of visual chain-of-thought reasoning capabilities.

Abstract: Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT’s effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

[142] Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model

Lin Xi, Yingliang Ma, Cheng Wang, Sandra Howell, Aldo Rinaldi, Kawal S. Rhode

Main category: cs.CV

TL;DR: A novel diffusion-based framework for semi-supervised medical image segmentation that uses prototype-based contrastive consistency to improve segmentation accuracy with limited labeled data, achieving state-of-the-art performance on medical imaging datasets.

Details

Motivation: Pixel-level annotations in medical imaging are expensive and time-consuming, requiring expert collaboration. Existing semi-supervised methods struggle with noisy pseudo-labels that disrupt semantic structure in latent space, limiting segmentation accuracy when working with limited annotated data.

Method: A diffusion-based framework that enforces prototype-based contrastive consistency during the denoising diffusion process. The method uses class prototypes as centralized semantic anchors in latent space rather than explicitly delineating semantic boundaries, improving robustness against noisy pseudo-labels in semi-supervised learning.

Result: The method outperforms state-of-the-art medical image segmentation approaches on EndoScapes2023 and the newly introduced MOSXAV benchmark dataset under semi-supervised learning settings. The approach demonstrates enhanced robustness and data efficiency for medical image segmentation tasks.

Conclusion: The proposed diffusion-based framework provides a robust and data-efficient solution for semi-supervised medical image segmentation with enhanced flexibility and strong potential for various clinical applications. The method successfully addresses the challenge of learning from limited labeled data while maintaining high segmentation accuracy.

Abstract: Obtaining pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi-supervised methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo-labels. In this paper, we propose a novel diffusion-based framework for semi-supervised medical image segmentation. Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype-based contrastive consistency. Rather than explicitly delineating semantic boundaries, the model leverages class prototypes centralized semantic representations in the latent space as anchors. This strategy improves the robustness of dense predictions, particularly in the presence of noisy pseudo-labels. We also introduce a new publicly available benchmark: Multi-Object Segmentation in X-ray Angiography Videos (MOSXAV), which provides detailed, manually annotated segmentation ground truth for multiple anatomical structures in X-ray angiography videos. Extensive experiments on the EndoScapes2023 and MOSXAV datasets demonstrate that our method outperforms state-of-the-art medical image segmentation approaches under the semi-supervised learning setting. This work presents a robust and data-efficient diffusion model that offers enhanced flexibility and strong potential for a wide range of clinical applications.

[143] VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie

Main category: cs.CV

TL;DR: VGGT-Long is a system that enables foundation models to perform kilometer-scale monocular 3D reconstruction in unbounded outdoor environments by using chunk-based processing, overlapping alignment, and lightweight loop closure optimization without requiring camera calibration or retraining.

Details

Motivation: Foundation models for 3D vision show remarkable capabilities but face memory limitations when extending to large-scale RGB stream 3D reconstruction, particularly in unbounded outdoor environments like those needed for autonomous driving applications.

Method: The authors propose a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization to address scalability bottlenecks. The system works without requiring camera calibration, depth supervision, or model retraining.

Result: VGGT-Long successfully runs on long RGB sequences where foundation models typically fail and achieves trajectory and reconstruction performance comparable to traditional methods. Evaluations on KITTI, Waymo, and Virtual KITTI datasets show accurate and consistent geometry across various conditions.

Conclusion: The work demonstrates the potential of leveraging foundation models for scalable monocular 3D scene reconstruction in real-world settings, particularly for autonomous driving scenarios, by overcoming previous memory and scalability limitations through efficient processing strategies.

Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

[144] DenseSR: Image Shadow Removal as Dense Prediction

Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

Main category: cs.CV

TL;DR: DenseSR is a dense prediction framework for single-image shadow removal that combines deep scene understanding with geometric-semantic priors and a novel Dense Fusion Block to achieve high-quality restoration while preserving both intra-shadow details and sharp boundaries.

Details

Motivation: Traditional shadow removal methods fail to simultaneously recover intra-shadow details and maintain sharp boundaries due to non-uniform content degradation and inherent ambiguity in challenging indirect illumination scenarios, resulting in inconsistent restoration and blurring that negatively affect downstream applications and viewing experience.

Method: The DenseSR framework employs two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) containing an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries.

Result: Extensive experimental results demonstrate that DenseSR outperforms existing shadow removal methods by effectively tackling inconsistent restoration and blurring issues while preserving both consistency and fidelity in the restored images.

Conclusion: The proposed DenseSR framework successfully addresses the limitations of traditional shadow removal methods by leveraging dense prediction perspective with geometric-semantic priors and adaptive component processing, achieving superior restoration quality that maintains both intra-shadow details and sharp boundaries.

Abstract: Shadows are a common factor degrading image quality. Single-image shadow removal (SR), particularly under challenging indirect illumination, is hampered by non-uniform content degradation and inherent ambiguity. Consequently, traditional methods often fail to simultaneously recover intra-shadow details and maintain sharp boundaries, resulting in inconsistent restoration and blurring that negatively affect both downstream applications and the overall viewing experience. To overcome these limitations, we propose the DenseSR, approaching the problem from a dense prediction perspective to emphasize restoration quality. This framework uniquely synergizes two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) in the decoder. The DFB employs adaptive component processing-using an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries-thereby directly tackling the inconsistent restoration and blurring issues. These purposefully processed components are effectively fused, yielding an optimized feature representation preserving both consistency and fidelity. Extensive experimental results demonstrate the merits of our approach over existing methods. Our code can be available on https://github$.$com/VanLinLin/DenseSR

[145] Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Garima Jain, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for cancer survival prediction from pathology images that uses dynamic patch selection, graph-guided clustering, attention mechanisms, and mixture density modeling to achieve state-of-the-art performance on renal cancer and lung adenocarcinoma datasets.

Details

Motivation: To improve cancer-specific survival prediction from whole slide pathology images by addressing challenges of large image sizes, tissue heterogeneity, and complex survival distributions that limit current state-of-the-art methods.

Method: A four-component modular framework: (1) dynamic patch selection via quantile-based thresholding to identify prognostically informative regions, (2) graph-guided k-means clustering for capturing phenotype-level heterogeneity, (3) attention mechanisms modeling intra- and inter-cluster relationships for contextualizing features, and (4) expert-guided mixture density modeling using Gaussian mixture models for survival distribution estimation.

Result: Achieved concordance index of 0.712±0.028 and Brier score of 0.254±0.018 on TCGA-KIRC (renal cancer), and concordance index of 0.645±0.017 and Brier score of 0.281±0.031 on TCGA-LUAD (lung adenocarcinoma), significantly outperforming state-of-the-art methods.

Conclusion: The proposed modular framework demonstrates superior predictive performance compared to existing methods and shows generalizability across diverse cancer types, highlighting the potential of integrating spatial coherence, attention mechanisms, and expert-guided modeling for cancer survival prediction from pathology images.

Abstract: We introduce a modular framework for predicting cancer-specific survival from whole slide pathology images (WSIs) that significantly improves upon the state-of-the-art accuracy. Our method integrating four key components. Firstly, to tackle large size of WSIs, we use dynamic patch selection via quantile-based thresholding for isolating prognostically informative tissue regions. Secondly, we use graph-guided k-means clustering to capture phenotype-level heterogeneity through spatial and morphological coherence. Thirdly, we use attention mechanisms that model both intra- and inter-cluster relationships to contextualize local features within global spatial relations between various types of tissue compartments. Finally, we use an expert-guided mixture density modeling for estimating complex survival distributions using Gaussian mixture models. The proposed model achieves a concordance index of $0.712 \pm 0.028$ and Brier score of $0.254 \pm 0.018$ on TCGA-KIRC (renal cancer), and a concordance index of $0.645 \pm 0.017$ and Brier score of $0.281 \pm 0.031$ on TCGA-LUAD (lung adenocarcinoma). These results are significantly better than the state-of-art and demonstrate predictive potential of the proposed method across diverse cancer types.

[146] PlantSAM: An Object Detection-Driven Segmentation Pipeline for Herbarium Specimens

Youcef Sklab, Florian Castanet, Hanane Ariouat, Souhila Arib, Jean-Daniel Zucker, Eric Chenin, Edi Prifti

Main category: cs.CV

TL;DR: PlantSAM combines YOLOv10 and SAM2 to automatically segment herbarium images, removing background noise to improve plant classification accuracy by up to 4.36%.

Details

Motivation: Deep learning classification of herbarium images suffers from background heterogeneity that introduces noise and artifacts, misleading models and reducing classification accuracy. Background-related challenges need to be addressed to improve model performance.

Method: PlantSAM pipeline integrates YOLOv10 for plant region detection and Segment Anything Model (SAM2) for segmentation. YOLOv10 generates bounding box prompts to guide SAM2. Both models were fine-tuned on herbarium images and evaluated using IoU and Dice coefficient metrics.

Result: PlantSAM achieved state-of-the-art segmentation performance with IoU of 0.94 and Dice coefficient of 0.97. Classification models using segmented images showed consistent improvements across five botanical traits, with accuracy gains up to 4.36% and F1-score improvements of 4.15%.

Conclusion: Background removal through automated segmentation significantly enhances herbarium image classification accuracy by allowing models to focus more effectively on foreground plant structures, demonstrating the critical importance of preprocessing in botanical image analysis.

Abstract: Deep learning-based classification of herbarium images is hampered by background heterogeneity, which introduces noise and artifacts that can potentially mislead models and reduce classification accuracy. Addressing these background-related challenges is critical to improving model performance. We introduce PlantSAM, an automated segmentation pipeline that integrates YOLOv10 for plant region detection and the Segment Anything Model (SAM2) for segmentation. YOLOv10 generates bounding box prompts to guide SAM2, enhancing segmentation accuracy. Both models were fine-tuned on herbarium images and evaluated using Intersection over Union (IoU) and Dice coefficient metrics. PlantSAM achieved state-of-the-art segmentation performance, with an IoU of 0.94 and a Dice coefficient of 0.97. Incorporating segmented images into classification models led to consistent performance improvements across five tested botanical traits, with accuracy gains of up to 4.36% and F1-score improvements of 4.15%. Our findings highlight the importance of background removal in herbarium image analysis, as it significantly enhances classification accuracy by allowing models to focus more effectively on the foreground plant structures.

[147] Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

Xiaoyan Wang, Zeju Li, Yifan Xu, Jiaxing Qi, Zhifei Yang, Ruifei Ma, Xiangde Liu, Chao Zhang

Main category: cs.CV

TL;DR: This paper proposes Spatial 3D-LLM, a multimodal large language model that enhances spatial awareness for 3D vision-language tasks through progressive spatial awareness schemes and location-enriched embeddings, achieving state-of-the-art performance on various 3D tasks.

Details

Motivation: Existing 3D multimodal LLMs have limited spatial awareness because they rely on compressing holistic 3D scene information or segmenting independent objects, which provides insufficient representation of the richness inherent in 3D scenes for spatial understanding.

Method: The authors develop Spatial 3D-LLM by integrating an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. They also introduce two novel tasks (3D object distance measurement and 3D layout editing) and construct a 3D instruction dataset called MODEL.

Result: Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, with improvements attributed to the progressive spatial awareness scheme that mines more profound spatial information from 3D scenes.

Conclusion: The progressive spatial awareness scheme successfully enhances spatial understanding in 3D multimodal LLMs, enabling better performance on 3D vision-language tasks and demonstrating the importance of enriched spatial embeddings for capturing the complexity of 3D scenes.

Abstract: New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model’s spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.

[148] EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang

Main category: cs.CV

TL;DR: This paper introduces EarthCrafter, a framework for large-scale 3D Earth surface generation using sparse-decoupled latent diffusion, trained on Aerial-Earth3D - the largest 3D aerial dataset with 50k scenes covering thousands of square kilometers of Earth’s surface.

Details

Motivation: Existing 3D generation methods cannot scale to geographic extents like modeling thousands of square kilometers of Earth's surface. There is a need for both large-scale 3D datasets and architectures capable of handling vast geographic scales while maintaining quality and detail.

Method: The approach has two main components: 1) Aerial-Earth3D dataset - 50k curated 600m x 600m scenes with 45M multi-view Google Earth frames, including pose annotations, depth maps, normals, and semantic segmentation. 2) EarthCrafter framework using sparse-decoupled latent diffusion that separates structural and textural generation through dual sparse 3D-VAEs and condition-aware flow matching models.

Result: EarthCrafter demonstrates substantially better performance in extremely large-scale 3D generation compared to existing methods. The framework supports versatile applications including semantic-guided urban layout generation and unconditional terrain synthesis while maintaining geographic plausibility.

Conclusion: The combination of the large-scale Aerial-Earth3D dataset and the EarthCrafter framework successfully addresses the challenge of scaling 3D generation to geographic extents, enabling realistic large-scale Earth surface modeling with various applications while preserving geographic authenticity.

Abstract: Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

[149] Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Tobias Rueckert, David Rauber, Raphaela Maerkl, Leonard Klausmann, Suemeyye R. Yildiran, Max Gutbrod, Danilo Weber Nunes, Alvaro Fernandez Moreno, Imanol Luengo, Danail Stoyanov, Nicolas Toussaint, Enki Cho, Hyeon Bae Kim, Oh Sung Choo, Ka Young Kim, Seong Tae Kim, Gonçalo Arantes, Kehan Song, Jianjun Zhu, Junchen Xiong, Tingyi Lin, Shunsuke Kikuchi, Hiroki Matsuzaki, Atsushi Kouno, João Renato Ribeiro Manesco, João Paulo Papa, Tae-Min Choi, Tae Kyeong Jeong, Juyoun Park, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Runzhi Wu, Mengya Xu, An Wang, Long Bai, Hongliang Ren, Amine Yamlahi, Jakob Hennighausen, Lena Maier-Hein, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Shu Yang, Yihui Wang, Hao Chen, Santiago Rodríguez, Nicolás Aparicio, Leonardo Manrique, Juan Camilo Lyons, Olivia Hosie, Nicolás Ayobi, Pablo Arbeláez, Yiping Li, Yasmina Al Khalil, Sahar Nasirihaghighi, Stefanie Speidel, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, Christoph Palm

Main category: cs.CV

TL;DR: The paper introduces the PhaKIR sub-challenge at MICCAI 2024, featuring a novel multi-center dataset of laparoscopic cholecystectomy videos with unified annotations for surgical phase recognition, instrument keypoint estimation, and instrument segmentation to advance temporally aware, context-driven methods in robot-assisted minimally invasive surgery.

Details

Motivation: Reliable recognition and localization of surgical instruments in endoscopic videos is crucial for computer- and robot-assisted minimally invasive surgery applications, but robust performance under real-world conditions remains challenging. Incorporating surgical context like procedural phases can improve robustness and interpretability, but existing datasets lack the ability to jointly investigate instrument localization and procedural context.

Method: The authors organized the PhaKIR sub-challenge and created a novel multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos from three medical institutions. The dataset provides unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation, enabling joint investigation and temporal information integration across entire procedures.

Result: The paper reports results and findings following BIAS guidelines for biomedical image analysis challenges. The dataset successfully enables joint investigation of instrument localization and procedural context within the same data while supporting temporal information integration across complete surgical procedures.

Conclusion: The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in robot-assisted minimally invasive surgery and offers a high-quality resource to support future research in surgical scene understanding.

Abstract: Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.

[150] A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang

Main category: cs.CV

TL;DR: A multimodal framework (MDP) for detecting deepfake video segments using only video-level labels, combining visual and audio features with cross-modal attention to identify temporal forgeries without requiring precise timestamp annotations.

Details

Motivation: Current deepfake detection methods are restrictive, time-consuming, and difficult to scale for large datasets when treating detection as classification or requiring fully-supervised temporal localization with precise annotations.

Method: MDP framework with two key components: (1) Multimodal Interaction (MI) mechanism using temporal property preserving cross-modal attention to measure visual-audio relevance in probabilistic embedding space, and (2) extensible deviation perceiving loss that enlarges deviation between adjacent forged segments while reducing deviation in genuine samples.

Result: The framework achieves comparable performance to fully-supervised approaches across several evaluation metrics while using only weak video-level supervision, demonstrating effectiveness in temporal forgery localization.

Conclusion: The proposed MDP framework successfully addresses scalability issues in deepfake detection by enabling weakly-supervised temporal forgery localization through multimodal deviation perception, providing a more practical solution for large-scale deepfake forensics.

Abstract: Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.

[151] Dyna3DGR: 4D Cardiac Motion Tracking with Dynamic 3D Gaussian Representation

Xueming Fu, Pei Wu, Yingtai Li, Xin Luo, Zihang Jiang, Junhao Mei, Jian Lu, Gao-Jun Teng, S. Kevin Zhou

Main category: cs.CV

TL;DR: This paper proposes Dyna3DGR, a novel framework combining 3D Gaussian representation with neural motion fields for accurate cardiac motion tracking in 4D CMR imaging, achieving superior performance over existing deep learning registration methods without requiring extensive training data.

Details

Motivation: Accurate cardiac motion analysis is crucial for evaluating cardiac function, but existing methods have significant limitations: image-based methods struggle with topological consistency or require extensive training data, while representation-based methods lose image-level details. The homogeneous nature of myocardial tissue and lack of distinctive features make fine-grained 4D cardiac motion tracking particularly challenging.

Method: The authors propose Dynamic 3D Gaussian Representation (Dyna3DGR), which combines explicit 3D Gaussian representation with implicit neural motion field modeling. The method simultaneously optimizes cardiac structure and motion in a self-supervised manner using differentiable volumetric rendering to bridge continuous motion representation with image-space alignment while preserving topological and temporal consistency.

Result: Comprehensive evaluations on the ACDC dataset demonstrate that Dyna3DGR surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The method eliminates the need for extensive training data or point-to-point correspondences while maintaining both topological and temporal consistency.

Conclusion: Dyna3DGR successfully addresses the limitations of existing cardiac motion tracking approaches by combining the benefits of both image-based and representation-based methods. The framework achieves superior tracking accuracy without requiring extensive training data, making it a promising solution for accurate 4D cardiac motion analysis in clinical applications.

Abstract: Accurate analysis of cardiac motion is crucial for evaluating cardiac function. While dynamic cardiac magnetic resonance imaging (CMR) can capture detailed tissue motion throughout the cardiac cycle, the fine-grained 4D cardiac motion tracking remains challenging due to the homogeneous nature of myocardial tissue and the lack of distinctive features. Existing approaches can be broadly categorized into image based and representation-based, each with its limitations. Image-based methods, including both raditional and deep learning-based registration approaches, either struggle with topological consistency or rely heavily on extensive training data. Representation-based methods, while promising, often suffer from loss of image-level details. To address these limitations, we propose Dynamic 3D Gaussian Representation (Dyna3DGR), a novel framework that combines explicit 3D Gaussian representation with implicit neural motion field modeling. Our method simultaneously optimizes cardiac structure and motion in a self-supervised manner, eliminating the need for extensive training data or point-to-point correspondences. Through differentiable volumetric rendering, Dyna3DGR efficiently bridges continuous motion representation with image-space alignment while preserving both topological and temporal consistency. Comprehensive evaluations on the ACDC dataset demonstrate that our approach surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The code will be available in https://github.com/windrise/Dyna3DGR.

[152] CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs

Haoyang Su, Shaohao Rui, Jinyi Xiang, Lianming Wu, Xiaosong Wang

Main category: cs.CV

TL;DR: A self-supervised framework called CTSL learns from raw Cine MRI sequences without segmentation masks to predict Major Adverse Cardiac Events (MACE), eliminating the need for contrast agents and providing a rapid, non-invasive cardiac risk assessment solution.

Details

Motivation: Existing MACE prediction methods require supervised learning with human-refined masks in ventricular myocardium and contrast agents, which becomes impractical in contrast-free scenarios. There is a critical need for accurate, contrast-free cardiac risk assessment that doesn't rely on manual segmentation.

Method: The paper introduces Codebook-based Temporal-Spatial Learning (CTSL), a self-supervised framework that decouples temporal and spatial features through multi-view distillation. The method uses a teacher model processing multiple Cine views and a student model learning from reduced-dimensional Cine-SA sequences, leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues.

Result: CTSL achieves high-confidence MACE risk predictions and outperforms traditional contrast-dependent methods. The framework successfully captures intricate temporal dependencies and motion patterns from raw Cine data without requiring segmentation masks.

Conclusion: The proposed CTSL framework provides a rapid, non-invasive solution for cardiac risk assessment that eliminates the need for contrast agents and manual segmentation, enabling timely and accessible heart disease diagnosis in clinical settings.

Abstract: Accurate and contrast-free Major Adverse Cardiac Events (MACE) prediction from Cine MRI sequences remains a critical challenge. Existing methods typically necessitate supervised learning based on human-refined masks in the ventricular myocardium, which become impractical without contrast agents. We introduce a self-supervised framework, namely Codebook-based Temporal-Spatial Learning (CTSL), that learns dynamic, spatiotemporal representations from raw Cine data without requiring segmentation masks. CTSL decouples temporal and spatial features through a multi-view distillation strategy, where the teacher model processes multiple Cine views, and the student model learns from reduced-dimensional Cine-SA sequences. By leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues, CTSL captures intricate temporal dependencies and motion patterns. High-confidence MACE risk predictions are achieved through our model, providing a rapid, non-invasive solution for cardiac risk assessment that outperforms traditional contrast-dependent methods, thereby enabling timely and accessible heart disease diagnosis in clinical settings.

[153] Automatic Fine-grained Segmentation-assisted Report Generation

Frederic Jonske, Constantin Seibold, Osman Alperen Koras, Fin Bahnsen, Marie Bauer, Amin Dada, Hamza Kalisch, Anton Schily, Jens Kleesiek

Main category: cs.CV

TL;DR: ASaRG enhances LLaVA architecture for clinical report generation by integrating intermediate features and fine-grained segmentation maps from specialist radiological models, achieving significant performance improvements while providing better grounding capabilities for medical professionals.

Details

Motivation: Clinical report generation requires strong general performance and grounding capabilities to convince clinicians and patients of report veracity. Current models lack sufficient reliability and interpretability needed for real-world medical applications where radiologists' workloads need to be reduced while maintaining trust.

Method: ASaRG extends LLaVA architecture by fusing intermediate features and fine-grained segmentation maps from specialist radiological models into LLaVA’s multi-modal projection layer through simple concatenation. The approach adds minimal parameters while incorporating segmentation information to improve both performance and interpretability.

Result: ASaRG achieves +0.89% CE F1 score improvement with intermediate features only (p=0.012) and +2.77% improvement with combined features and segmentation maps (p<0.001) compared to LLaVA baseline. It outperforms COMG and ORID by 6.98% and 6.28% F1 score respectively. The method enables tracing report elements to corresponding segmentation maps for verification.

Conclusion: ASaRG successfully enhances clinical report generation by integrating segmentation information into LLaVA, providing both performance improvements and better grounding capabilities. The method is compatible with other architectural advances and offers interpretability through segmentation-to-report tracing, making it more suitable for clinical deployment.

Abstract: Reliable end-to-end clinical report generation has been a longstanding goal of medical ML research. The end goal for this process is to alleviate radiologists’ workloads and provide second opinions to clinicians or patients. Thus, a necessary prerequisite for report generation models is a strong general performance and some type of innate grounding capability, to convince clinicians or patients of the veracity of the generated reports. In this paper, we present ASaRG (\textbf{A}utomatic \textbf{S}egmentation-\textbf{a}ssisted \textbf{R}eport \textbf{G}eneration), an extension of the popular LLaVA architecture that aims to tackle both of these problems. ASaRG proposes to fuse intermediate features and fine-grained segmentation maps created by specialist radiological models into LLaVA’s multi-modal projection layer via simple concatenation. With a small number of added parameters, our approach achieves a +0.89% performance gain ($p=0.012$) in CE F1 score compared to the LLaVA baseline when using only intermediate features, and +2.77% performance gain ($p<0.001$) when adding a combination of intermediate features and fine-grained segmentation maps. Compared with COMG and ORID, two other report generation methods that utilize segmentations, the performance gain amounts to 6.98% and 6.28% in F1 score, respectively. ASaRG is not mutually exclusive with other changes made to the LLaVA architecture, potentially allowing our method to be combined with other advances in the field. Finally, the use of an arbitrary number of segmentations as part of the input demonstrably allows tracing elements of the report to the corresponding segmentation maps and verifying the groundedness of assessments. Our code will be made publicly available at a later date.

[154] A2Mamba: Attention-augmented State Space Models for Visual Recognition

Meng Lou, Yunxiang Fu, Yizhou Yu

Main category: cs.CV

TL;DR: A2Mamba proposes a novel Transformer-Mamba hybrid architecture that deeply integrates attention mechanisms with State Space Models through a Multi-scale Attention-augmented State Space Model (MASS), achieving state-of-the-art performance across visual recognition tasks including ImageNet classification, semantic segmentation, and object detection.

Details

Motivation: Existing methods that combine Transformers and Mamba for visual recognition simply stack layers without interaction mechanisms, limiting their ability to capture both local details and global contexts effectively. There is a need for deeper integration between Transformer and Mamba layers to enhance spatial dependencies and dynamic modeling capabilities.

Method: The paper introduces A2Mamba, featuring a Multi-scale Attention-augmented State Space Model (MASS) token mixer. The key innovation is the Attention-augmented SSM (A2SSM) that performs cross-attention by spatially aggregating SSM hidden states using multi-scale attention maps, enhancing both spatial dependencies in 2D space and dynamic modeling capabilities of SSMs.

Result: A2Mamba achieves superior performance across multiple tasks: A2Mamba-L reaches 86.1% top-1 accuracy on ImageNet-1K, A2Mamba-B exceeds CAFormer-S36 by 2.5% mIoU in semantic segmentation with higher efficiency, and A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in object detection/instance segmentation while using 40% fewer parameters.

Conclusion: A2Mamba successfully addresses the limitation of simple layer stacking in existing Transformer-Mamba hybrids by proposing a deep integration mechanism that outperforms all previous ConvNet, Transformer, and Mamba-based architectures across various visual recognition tasks, demonstrating the effectiveness of attention-augmented state space models for computer vision.

Abstract: Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM’s hidden states using the multi-scale attention maps, which enhances spatial dependencies pertaining to a two-dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet-, Transformer-, and Mamba-based architectures in visual recognition tasks. For instance, A2Mamba-L achieves an impressive 86.1% top-1 accuracy on ImageNet-1K. In semantic segmentation, A2Mamba-B exceeds CAFormer-S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R-CNN, A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at https://github.com/LMMMEng/A2Mamba.

[155] Benchmarking pig detection and tracking under diverse and challenging conditions

Jonathan Henrich, Christian Post, Maximilian Zilke, Parth Shiroya, Emma Chanut, Amir Mollazadeh Yamchi, Ramin Yahyapour, Thomas Kneib, Imke Traulsen

Main category: cs.CV

TL;DR: This paper presents the first systematic benchmarking study for pig detection and tracking in farming environments, introducing two datasets (PigDetect and PigTrack) and comparing various machine learning approaches for automated pig monitoring.

Details

Motivation: Manual monitoring of individual pig behavior is labor-intensive and inefficient. While machine learning advances enable automated monitoring through object detection and multi-object tracking, there has been no systematic benchmarking study to evaluate different approaches for pig farming applications, creating a knowledge gap in the field.

Method: The authors curated two comprehensive datasets: PigDetect for object detection and PigTrack for multi-object tracking, using diverse image and video material from realistic barn conditions with challenging scenarios. They systematically compared state-of-the-art detection models against real-time alternatives, and evaluated SORT-based tracking methods versus end-to-end trainable models.

Result: For object detection, challenging training images significantly improved performance over random sampling, with state-of-the-art models substantially outperforming real-time alternatives. For tracking, SORT-based methods achieved superior detection performance while end-to-end models showed better association performance. Both detection and tracking models demonstrated good generalization to unseen environments.

Conclusion: The study establishes the first systematic benchmark for pig monitoring, demonstrating that high-quality training data is crucial for good generalization. SORT-based tracking currently performs better overall, but end-to-end models show promise for future development. The publicly released datasets and code will facilitate reproducibility and further research in automated pig farming monitoring.

Abstract: To ensure animal welfare and effective management in pig farming, monitoring individual behavior is a crucial prerequisite. While monitoring tasks have traditionally been carried out manually, advances in machine learning have made it possible to collect individualized information in an increasingly automated way. Central to these methods is the localization of animals across space (object detection) and time (multi-object tracking). Despite extensive research of these two tasks in pig farming, a systematic benchmarking study has not yet been conducted. In this work, we address this gap by curating two datasets: PigDetect for object detection and PigTrack for multi-object tracking. The datasets are based on diverse image and video material from realistic barn conditions, and include challenging scenarios such as occlusions or bad visibility. For object detection, we show that challenging training images improve detection performance beyond what is achievable with randomly sampled images alone. Comparing different approaches, we found that state-of-the-art models offer substantial improvements in detection quality over real-time alternatives. For multi-object tracking, we observed that SORT-based methods achieve superior detection performance compared to end-to-end trainable models. However, end-to-end models show better association performance, suggesting they could become strong alternatives in the future. We also investigate characteristic failure cases of end-to-end models, providing guidance for future improvements. The detection and tracking models trained on our datasets perform well in unseen pens, suggesting good generalization capabilities. This highlights the importance of high-quality training data. The datasets and research code are made publicly available to facilitate reproducibility, re-use and further development.

[156] Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection

Shuang Song, Yang Tang, Rongjun Qin

Main category: cs.CV

TL;DR: This paper proposes a novel approach for building segmentation in remote sensing that generates geo-typical synthetic data tailored to target regions and uses adversarial domain adaptation to overcome synthetic-to-real domain gaps, achieving up to 12% performance improvements without requiring extensive real-world annotations.

Details

Motivation: Deep learning models for building segmentation struggle to generalize across diverse geographic regions due to variations in city layouts, building types, sizes, and locations. The time-consuming process of creating annotated data cannot keep pace with the demands of increasingly data-hungry models, creating a need for alternative approaches to improve model generalization without extensive real-world annotations.

Method: The approach involves re-training models at test time using synthetic data that replicates target region characteristics. It leverages geospatial data from OpenStreetMap to generate geo-typical synthetic images through procedural modeling and physics-based rendering, incorporating domain randomization in building shapes, materials, and environmental illumination. The synthetic data is then integrated into an adversarial domain adaptation framework to bridge the synthetic-to-real domain gap.

Result: Experiments demonstrate significant performance enhancements with median improvements of up to 12%, depending on the domain gap. The method successfully generates virtually unlimited training samples while maintaining essential characteristics of target environments and provides a scalable, cost-effective solution for improving building segmentation across different geographic regions.

Conclusion: This work presents a promising solution to the “model collapse” issue in purely synthetic datasets by blending partial geographic knowledge with synthetic imagery. The approach offers a practical pathway to improving generalization in remote sensing building segmentation without requiring extensive real-world annotations, making it both scalable and cost-effective for diverse geographic applications.

Abstract: Deep learning has significantly advanced building segmentation in remote sensing, yet models struggle to generalize on data of diverse geographic regions due to variations in city layouts and the distribution of building types, sizes and locations. However, the amount of time-consuming annotated data for capturing worldwide diversity may never catch up with the demands of increasingly data-hungry models. Thus, we propose a novel approach: re-training models at test time using synthetic data tailored to the target region’s city layout. This method generates geo-typical synthetic data that closely replicates the urban structure of a target area by leveraging geospatial data such as street network from OpenStreetMap. Using procedural modeling and physics-based rendering, very high-resolution synthetic images are created, incorporating domain randomization in building shapes, materials, and environmental illumination. This enables the generation of virtually unlimited training samples that maintain the essential characteristics of the target environment. To overcome synthetic-to-real domain gaps, our approach integrates geo-typical data into an adversarial domain adaptation framework for building segmentation. Experiments demonstrate significant performance enhancements, with median improvements of up to 12%, depending on the domain gap. This scalable and cost-effective method blends partial geographic knowledge with synthetic imagery, providing a promising solution to the “model collapse” issue in purely synthetic datasets. It offers a practical pathway to improving generalization in remote sensing building segmentation without extensive real-world annotations.

[157] QRetinex-Net: Quaternion-Valued Retinex Decomposition for Low-Level Computer Vision Applications

Sos Agaian, Vladimir Frants

Main category: cs.CV

TL;DR: This paper introduces Quaternion Retinex, a novel formulation that uses quaternion algebra to decompose images into reflectance and illumination components, addressing key limitations of classical Retinex models and achieving superior performance in low-light computer vision tasks.

Details

Motivation: Classical Retinex models have four major flaws: they process RGB channels independently, lack neuroscientific grounding for color vision, cannot perfectly reconstruct input images, and fail to explain human color constancy. These limitations hurt computer vision accuracy in low-light conditions where images suffer from color shift, low contrast, and noise.

Method: The authors propose the first Quaternion Retinex formulation that represents the scene as a Hamilton product of quaternion-valued reflectance and illumination components. They also introduce the Reflectance Consistency Index to measure how well reflectance remains invariant across different lighting conditions.

Result: Testing on low-light crack inspection, face detection under varied lighting, and infrared-visible fusion demonstrated 2-11% performance gains over leading methods. The approach achieved better color fidelity, lower noise levels, and higher reflectance stability compared to existing techniques.

Conclusion: Quaternion Retinex successfully addresses the fundamental limitations of classical Retinex models by using quaternion algebra for image decomposition. This approach provides a more mathematically sound and biologically plausible method for handling color constancy and illumination invariance in computer vision applications.

Abstract: Images taken in low light often show color shift, low contrast, noise, and other artifacts that hurt computer-vision accuracy. Retinex theory addresses this by viewing an image S as the pixel-wise product of reflectance R and illumination I, mirroring the way people perceive stable object colors under changing light. The decomposition is ill-posed, and classic Retinex models have four key flaws: (i) they treat the red, green, and blue channels independently; (ii) they lack a neuroscientific model of color vision; (iii) they cannot perfectly rebuild the input image; and (iv) they do not explain human color constancy. We introduce the first Quaternion Retinex formulation, in which the scene is written as the Hamilton product of quaternion-valued reflectance and illumination. To gauge how well reflectance stays invariant, we propose the Reflectance Consistency Index. Tests on low-light crack inspection, face detection under varied lighting, and infrared-visible fusion show gains of 2-11 percent over leading methods, with better color fidelity, lower noise, and higher reflectance stability.

[158] Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation

Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang

Main category: cs.CV

TL;DR: This paper introduces MpGI, a two-stage method for generating high-quality captions for remote sensing images, creating the HQRS-IT-210K dataset with 210K images and 1.3M captions, and achieving state-of-the-art performance with HQRS-CLIP using only 4.2% of typical training data.

Details

Motivation: The main challenge in applying Vision-language foundation models to remote sensing imagery is the scarcity of high-quality, large-scale image-text paired training data. Existing datasets have suboptimal quality due to rudimentary caption generation methods, requiring larger training volumes while yielding only modest performance improvements.

Method: The paper proposes MpGI (Multi-Perspective Generation and Integration), a two-stage approach: (1) Generate distinct and detailed descriptions from different perspectives using Rule-MLLM Relay Generation and MLLMs generation methods, (2) Use Large Language Models to integrate these diverse descriptions into comprehensive captions that capture details from multiple perspectives. This creates the HQRS-IT-210K dataset with 210K RS images and 1.3M captions.

Result: The authors fine-tuned CLIP (discriminative) and CoCa (generative) models on their dataset, creating HQRS-CLIP and RS-CoCa. HQRS-CLIP surpassed previous state-of-the-art RS CLIP models in various downstream tasks while using only 4.2% of typical training data. RS-CoCa outperformed other advanced approaches across benchmark datasets and generated captions that rival or exceed manual annotations.

Conclusion: The MpGI method successfully addresses the data quality challenge in remote sensing vision-language models by generating high-quality multi-perspective captions. The resulting models achieve superior performance with significantly less training data, demonstrating the effectiveness of high-quality dataset creation over simply increasing data volume.

Abstract: The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at https://github.com/YiguoHe/HQRS-210K-and-HQRS-CLIP.

[159] Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction

Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath

Main category: cs.CV

TL;DR: This paper introduces temporally-constrained video reasoning segmentation, a new task that enables natural language queries to identify objects in videos while considering when those objects become contextually relevant over time, with applications in surgical video analysis.

Details

Motivation: Conventional video segmentation methods are limited to predefined object categories and cannot handle out-of-vocabulary objects or objects referenced implicitly in complex text queries. Existing video reasoning segmentation assumes target objects remain relevant throughout entire videos, which is inadequate for real-world scenarios like surgical procedures where objects appear, disappear, or change relevance dynamically based on temporal context.

Method: The authors propose temporally-constrained video reasoning segmentation that requires models to implicitly infer when target objects become contextually relevant based on text queries incorporating temporal reasoning. They develop an innovative automated benchmark construction method to avoid expensive manual annotation and create the TCVideoRSBenchmark dataset.

Result: The paper presents TCVideoRSBenchmark, a temporally-constrained video reasoning segmentation dataset containing 52 samples using videos from the MVOR dataset. The automated benchmark construction method enables scalable dataset creation without expensive manual annotation.

Conclusion: The work successfully addresses limitations in existing video segmentation by introducing temporal constraints in reasoning segmentation, enabling more flexible and context-aware video analysis suitable for complex real-world scenarios like surgical procedures where object relevance changes dynamically over time.

Abstract: Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out-of-vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target objects remain contextually relevant throughout entire video sequences. This assumption is inadequate for real-world scenarios in which objects of interest appear, disappear or change relevance dynamically based on temporal context, such as surgical instruments that become relevant only during specific procedural phases or anatomical structures that gain importance at particular moments during surgery. Our first contribution is the introduction of temporally-constrained video reasoning segmentation, a novel task formulation that requires models to implicitly infer when target objects become contextually relevant based on text queries that incorporate temporal reasoning. Since manual annotation of temporally-constrained video RS datasets would be expensive and limit scalability, our second contribution is an innovative automated benchmark construction method. Finally, we present TCVideoRSBenchmark, a temporally-constrained video RS dataset containing 52 samples using the videos from the MVOR dataset.

[160] HarmonPaint: Harmonized Training-Free Diffusion Inpainting

Ying Li, Xinzhe Li, Yong Du, Yangyang Xu, Junyu Dong, Shengfeng He

Main category: cs.CV

TL;DR: HarmonPaint is a training-free inpainting framework that integrates with diffusion model attention mechanisms to achieve high-quality, harmonized image inpainting without requiring retraining or fine-tuning.

Details

Motivation: Existing inpainting methods require extensive retraining or fine-tuning to integrate new content and struggle to maintain coherence in both structure and style between inpainted regions and surrounding background.

Method: The framework leverages masking strategies within self-attention mechanisms of diffusion models to ensure structural fidelity and exploits intrinsic diffusion model properties to transfer style information from unmasked to masked regions.

Result: Extensive experiments demonstrate the effectiveness of HarmonPaint across diverse scenes and styles, validating its versatility and performance in achieving harmonious integration of styles.

Conclusion: HarmonPaint successfully addresses the limitations of existing inpainting methods by providing a training-free solution that maintains both structural and stylistic coherence in image inpainting tasks.

Abstract: Existing inpainting methods often require extensive retraining or fine-tuning to integrate new content seamlessly, yet they struggle to maintain coherence in both structure and style between inpainted regions and the surrounding background. Motivated by these limitations, we introduce HarmonPaint, a training-free inpainting framework that seamlessly integrates with the attention mechanisms of diffusion models to achieve high-quality, harmonized image inpainting without any form of training. By leveraging masking strategies within self-attention, HarmonPaint ensures structural fidelity without model retraining or fine-tuning. Additionally, we exploit intrinsic diffusion model properties to transfer style information from unmasked to masked regions, achieving a harmonious integration of styles. Extensive experiments demonstrate the effectiveness of HarmonPaint across diverse scenes and styles, validating its versatility and performance.

Shuai Chen, Fanman Meng, Xiwei Zhang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li

Main category: cs.CV

TL;DR: DFR is a novel few-shot segmentation framework that integrates visual, textual, and audio modalities using SAM to overcome limitations of existing single/dual-modal approaches through decomposition, contrastive fusion, and dual-path reconstruction.

Details

Motivation: Existing few-shot segmentation approaches rely on single or dual modalities (visual support samples or textual descriptions), which limits their ability to exploit the rich perceptual information available in real-world multi-modal scenarios.

Method: The DFR framework consists of three key components: 1) Multi-modal Decompose using SAM to extract visual regions, expand textual semantics, and process audio features; 2) Multi-modal Contrastive Fuse employing contrastive learning for cross-modal consistency and dynamic semantic interactions; 3) Dual-path Reconstruct combining tri-modal semantic guidance with multi-modal geometric cues.

Result: Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings show substantial performance improvements over state-of-the-art methods.

Conclusion: The DFR framework successfully addresses the fundamental challenge of multi-modal guidance utilization in few-shot segmentation by systematically integrating three modalities, demonstrating superior performance compared to existing approaches.

Abstract: This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR’s substantial performance improvements over state-of-the-art methods.

[162] Denoising-While-Completing Network (DWCNet): Robust Point Cloud Completion Under Corruption

Keneni W. Tesema, Lyndon Hill, Mark W. Jones, Gary K. L. Tam

Main category: cs.CV

TL;DR: This paper introduces DWCNet, a robust point cloud completion network that can handle multiple simultaneous degradations (noise and occlusions) in real-world scenarios, along with a new benchmark dataset CPCCD for evaluating robustness.

Details

Motivation: Existing point cloud completion networks trained on synthetic data struggle with real-world degradations like noise and occlusions. There's a need for robust completion methods that can handle multiple simultaneous corruptions in practical applications like autonomous driving and robotics.

Method: The authors propose DWCNet (Denoising-While-Completing Network) with a Noise Management Module (NMM) that uses contrastive learning and self-attention mechanisms to simultaneously suppress noise and model structural relationships for point cloud completion.

Result: DWCNet achieves state-of-the-art performance on both clean and corrupted datasets, as well as synthetic and real-world datasets. The method demonstrates superior robustness compared to existing approaches when tested on the newly introduced CPCCD benchmark.

Conclusion: The paper successfully addresses the robustness gap in point cloud completion by introducing both a comprehensive benchmark (CPCCD) and an effective solution (DWCNet) that can handle multiple degradations simultaneously, making it more suitable for real-world applications.

Abstract: Point cloud completion is crucial for 3D computer vision tasks in autonomous driving, augmented reality, and robotics. However, obtaining clean and complete point clouds from real-world environments is challenging due to noise and occlusions. Consequently, most existing completion networks – trained on synthetic data – struggle with real-world degradations. In this work, we tackle the problem of completing and denoising highly corrupted partial point clouds affected by multiple simultaneous degradations. To benchmark robustness, we introduce the Corrupted Point Cloud Completion Dataset (CPCCD), which highlights the limitations of current methods under diverse corruptions. Building on these insights, we propose DWCNet (Denoising-While-Completing Network), a completion framework enhanced with a Noise Management Module (NMM) that leverages contrastive learning and self-attention to suppress noise and model structural relationships. DWCNet achieves state-of-the-art performance on both clean and corrupted, synthetic and real-world datasets. The dataset and code will be publicly available at https://github.com/keneniwt/DWCNET-Robust-Point-Cloud-Completion-against-Corruptions

[163] CMP: A Composable Meta Prompt for SAM-Based Cross-Domain Few-Shot Segmentation

Shuai Chen, Fanman Meng, Chunjin Yang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li

Main category: cs.CV

TL;DR: The paper proposes CMP framework to adapt SAM for Cross-Domain Few-Shot Segmentation by introducing three modules that enable automated prompt generation and cross-domain capability, achieving state-of-the-art performance with 71.8% and 74.5% mIoU in 1-shot and 5-shot scenarios.

Details

Motivation: Cross-Domain Few-Shot Segmentation faces challenges due to limited data and domain shifts. While SAM shows promise for few-shot scenarios with its zero-shot generalization capability, it has critical limitations including reliance on manual prompts and limited cross-domain ability when applied to CD-FSS tasks.

Method: The paper introduces the Composable Meta-Prompt (CMP) framework with three key modules: (1) Reference Complement and Transformation (RCT) module for semantic expansion, (2) Composable Meta-Prompt Generation (CMPG) module for automated meta-prompt synthesis, and (3) Frequency-Aware Interaction (FAI) module for domain discrepancy mitigation.

Result: The CMP framework achieves state-of-the-art performance across four cross-domain datasets, with 71.8% mIoU in 1-shot scenarios and 74.5% mIoU in 5-shot scenarios, demonstrating superior cross-domain few-shot segmentation capabilities.

Conclusion: The proposed CMP framework successfully addresses the key challenges of adapting SAM to Cross-Domain Few-Shot Segmentation by automating prompt generation and enhancing cross-domain transferability, establishing new state-of-the-art performance benchmarks in the field.

Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) remains challenging due to limited data and domain shifts. Recent foundation models like the Segment Anything Model (SAM) have shown remarkable zero-shot generalization capability in general segmentation tasks, making it a promising solution for few-shot scenarios. However, adapting SAM to CD-FSS faces two critical challenges: reliance on manual prompt and limited cross-domain ability. Therefore, we propose the Composable Meta-Prompt (CMP) framework that introduces three key modules: (i) the Reference Complement and Transformation (RCT) module for semantic expansion, (ii) the Composable Meta-Prompt Generation (CMPG) module for automated meta-prompt synthesis, and (iii) the Frequency-Aware Interaction (FAI) module for domain discrepancy mitigation. Evaluations across four cross-domain datasets demonstrate CMP’s state-of-the-art performance, achieving 71.8% and 74.5% mIoU in 1-shot and 5-shot scenarios respectively.

[164] Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks

Marcel Kleinmann, Shashank Agnihotri, Margret Keuper

Main category: cs.CV

TL;DR: This paper improves B-cos networks for medical imaging by introducing anti-aliasing strategies (FLCPooling and BlurPool) to reduce explanation artifacts and extending the method to multi-label classification, making it suitable for chest X-ray analysis where multiple abnormalities can co-occur.

Details

Motivation: Standard B-cos networks produce inherently interpretable explanations but suffer from severe aliasing artifacts in explanation maps, making them unsuitable for clinical use. Additionally, the original B-cos formulation only supports multi-class settings, while chest X-ray analysis often requires multi-label classification due to co-occurring abnormalities.

Method: The authors introduce two anti-aliasing strategies: FLCPooling (FLC) and BlurPool (BP) to improve explanation quality and reduce artifacts. They also extend the B-cos network architecture to support multi-label classification for medical imaging applications.

Result: The modified B-cos networks (B-cos_FLC and B-cos_BP) preserve strong predictive performance while providing faithful and artifact-free explanations. Experiments on chest X-ray datasets demonstrate the effectiveness of both anti-aliasing approaches in multi-label settings.

Conclusion: The enhanced B-cos networks successfully address the limitations of the original formulation, producing high-quality, interpretable explanations suitable for clinical application in multi-label chest X-ray analysis while maintaining competitive diagnostic performance.

Abstract: Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. Additionally, the original B-cos formulation is limited to multi-class settings, whereas chest X-ray analysis often requires multi-label classification due to co-occurring abnormalities. In this work, we address both limitations: (1) we introduce anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality, and (2) we extend B-cos networks to support multi-label classification. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}\text{FLC}$ and $\text{B-cos}\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-label settings. Code available at: $\href{https://github.com/mkleinma/B-cos-medical-paper}{GitHub repository}$.

[165] Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Changhao Li, Xinrui Chen, Ji Wang, Kang Zhao, Jianfei Chen

Main category: cs.CV

TL;DR: This paper proposes a novel zero-shot quantization framework specifically designed for object detection networks that generates task-specific synthetic calibration data and integrates task-specific training with knowledge distillation to achieve state-of-the-art performance without requiring original training data.

Details

Motivation: Traditional quantization methods require access to original training data, which is often restricted due to privacy and security concerns. Existing zero-shot quantization methods for object detection use unlabeled task-agnostic synthetic images that lack specific information needed for object detection, resulting in suboptimal performance.

Method: The framework consists of two main stages: (1) A bounding box and category sampling strategy to synthesize task-specific calibration sets from pre-trained networks, reconstructing object locations, sizes, and category distributions without prior knowledge; (2) Integration of task-specific training into the knowledge distillation process to restore quantized detection network performance.

Result: Extensive experiments on MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of the proposed method compared to existing zero-shot quantization approaches for object detection.

Conclusion: The proposed task-specific zero-shot quantization framework successfully addresses the limitations of existing methods by generating appropriate synthetic data and incorporating task-specific training, achieving superior performance in object detection network quantization without requiring original training data.

Abstract: Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our code is publicly available at: https://github.com/DFQ-Dojo/dfq-toolkit .

[166] Enhancing Domain Diversity in Synthetic Data Face Recognition with Dataset Fusion

Anjith George, Sebastien Marcel

Main category: cs.CV

TL;DR: This paper proposes combining two synthetic face datasets from different generator architectures to improve face recognition model performance while addressing privacy concerns associated with web-crawled training data.

Details

Motivation: Face recognition datasets are often collected through web crawling without user consent, raising ethical and privacy concerns. While synthetic data offers a solution, models trained on synthetic data typically underperform compared to real-world data due to generator-specific artifacts and limited diversity from single generator models.

Method: The authors combine two state-of-the-art synthetic face datasets generated using architecturally distinct generator backbones. This fusion approach reduces model-specific artifacts, enhances diversity in pose, lighting, and demographics, and provides implicit regularization by emphasizing identity-relevant features.

Result: Models trained on the combined synthetic dataset achieved superior performance across many standard face recognition benchmarks compared to models trained on individual synthetic datasets.

Conclusion: Combining synthetic face datasets from different generator architectures effectively addresses the limitations of single-generator approaches, providing better performance while maintaining ethical data collection practices for face recognition systems.

Abstract: While the accuracy of face recognition systems has improved significantly in recent years, the datasets used to train these models are often collected through web crawling without the explicit consent of users, raising ethical and privacy concerns. To address this, many recent approaches have explored the use of synthetic data for training face recognition models. However, these models typically underperform compared to those trained on real-world data. A common limitation is that a single generator model is often used to create the entire synthetic dataset, leading to model-specific artifacts that may cause overfitting to the generator’s inherent biases and artifacts. In this work, we propose a solution by combining two state-of-the-art synthetic face datasets generated using architecturally distinct backbones. This fusion reduces model-specific artifacts, enhances diversity in pose, lighting, and demographics, and implicitly regularizes the face recognition model by emphasizing identity-relevant features. We evaluate the performance of models trained on this combined dataset using standard face recognition benchmarks and demonstrate that our approach achieves superior performance across many of these benchmarks.

[167] HOComp: Interaction-Aware Human-Object Composition

Dong Liang, Jinyuan Jia, Yuhao Liu, Rynson W. H. Lau

Main category: cs.CV

TL;DR: HOComp is a novel approach for compositing foreground objects onto human-centric background images while ensuring natural human-object interactions and consistent appearances, using MLLMs-driven pose guidance and detail-consistent appearance preservation techniques.

Details

Motivation: Existing image-guided composition methods struggle to synthesize seamless interaction-aware compositions when involving human-object interactions, often failing to achieve natural blending and harmonious interactions between foreground objects and background humans.

Method: The method includes two key components: (1) MLLMs-driven Region-based Pose Guidance (MRPG) that uses MLLMs to identify interaction regions and types, providing pose constraints with human pose landmarks; and (2) Detail-Consistent Appearance Preservation (DCAP) that combines shape-aware attention modulation, multi-view appearance loss, and background consistency loss to maintain consistent shapes/textures and faithful background reproduction.

Result: HOComp effectively generates harmonious human-object interactions with consistent appearances and outperforms relevant methods both qualitatively and quantitatively on the proposed IHOC dataset. The approach successfully addresses the challenges of interaction-aware composition that existing methods struggle with.

Conclusion: The paper successfully introduces HOComp as an effective solution for human-object interaction composition, demonstrating superior performance through the novel combination of MLLM-driven pose guidance and appearance preservation techniques. The introduction of the IHOC dataset also provides a valuable benchmark for future research in this area.

Abstract: While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.

[168] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang

Main category: cs.CV

TL;DR: ThinkAct is a dual-system framework that separates high-level reasoning from low-level action execution in vision-language-action tasks, using a multimodal LLM to generate reasoning plans that are compressed into visual latents to guide action models, enabling better planning and adaptation in embodied AI.

Details

Motivation: Existing VLA models train end-to-end by directly mapping inputs to actions without explicit reasoning, which limits their ability to perform multi-step planning and adapt to complex task variations in dynamic environments.

Method: ThinkAct uses a dual-system approach: (1) trains a multimodal LLM to generate embodied reasoning plans using reinforced visual rewards based on goal completion and trajectory consistency, and (2) compresses these reasoning plans into visual plan latents that condition a downstream action model for execution.

Result: Extensive experiments on embodied reasoning and robot manipulation benchmarks show that ThinkAct achieves few-shot adaptation, long-horizon planning capabilities, and self-correction behaviors in complex embodied AI tasks.

Conclusion: The dual-system framework successfully bridges high-level reasoning with low-level action execution, demonstrating improved performance in complex embodied AI tasks through explicit reasoning and visual latent planning.

Abstract: Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

[169] Vision-based Conflict Detection within Crowds based on High-Resolution Human Pose Estimation for Smart and Safe Airport

Karan Kheta, Claire Delgove, Ruolin Liu, Adeola Aderogba, Marc-Olivier Pokam, Muhammed Mehmet Unal, Yang Xing, Weisi Guo

Main category: cs.CV

TL;DR: This paper develops a machine learning model using HRNet for pose segmentation and SVM classification to detect conflicting behavior in airport crowds, achieving 94.37% precision but struggling with ambiguous behaviors like hugging.

Details

Motivation: Future airports are becoming increasingly complex and congested with growing traveler numbers, creating potential conflict hotspots that can cause flight delays and safety issues. There is a need for intelligent algorithms to enhance security surveillance effectiveness in detecting conflicts to improve passenger safety, financial efficiency, and travel experience.

Method: The approach uses HRNet for image segmentation to identify people in frames, followed by pose classification through multiple classifiers. Two different approaches were tested for classifying human poses, with support vector machine (SVM) being among the classifiers evaluated for detecting conflicting behavior in crowd scenarios.

Result: The support vector machine (SVM) classifier achieved the best performance with 94.37% precision in detecting conflicting behavior. However, the model showed limitations in handling ambiguous behaviors such as hugging and had difficulty maintaining tracking of subjects within the frame.

Conclusion: The developed model shows potential for airport deployment but requires improvements to handle large numbers of passengers and better training on ambiguous behaviors common in airport settings. With these enhancements, the system could significantly improve security surveillance capabilities and overall airport safety.

Abstract: Future airports are becoming more complex and congested with the increasing number of travellers. While the airports are more likely to become hotspots for potential conflicts to break out which can cause serious delays to flights and several safety issues. An intelligent algorithm which renders security surveillance more effective in detecting conflicts would bring many benefits to the passengers in terms of their safety, finance, and travelling efficiency. This paper details the development of a machine learning model to classify conflicting behaviour in a crowd. HRNet is used to segment the images and then two approaches are taken to classify the poses of people in the frame via multiple classifiers. Among them, it was found that the support vector machine (SVM) achieved the most performant achieving precision of 94.37%. Where the model falls short is against ambiguous behaviour such as a hug or losing track of a subject in the frame. The resulting model has potential for deployment within an airport if improvements are made to cope with the vast number of potential passengers in view as well as training against further ambiguous behaviours which will arise in an airport setting. In turn, will provide the capability to enhance security surveillance and improve airport safety.

[170] Make Me Happier: Evoking Emotions Through Image Diffusion Models

Qing Lin, Jingfeng Zhang, Yew-Soon Ong, Mengmi Zhang

Main category: cs.CV

TL;DR: This paper introduces a novel emotional image editing approach using diffusion models that can modify images to evoke target emotions while preserving semantic content and structure, supported by a new 340,000-pair emotion-annotated dataset and psychophysics-based evaluation metrics.

Details

Motivation: Emotional image editing remains under-explored despite rapid progress in image generation. The ability to modify images to evoke specific emotions has valuable applications in psychological treatment, product commercialization, and artistic design, but current methods lack effective emotion-aware editing capabilities.

Method: The authors propose a diffusion model specifically designed for emotion-evoked image generation that can understand and edit source images to convey desired emotions. They create a large-scale dataset of 340,000 image-emotion annotation pairs and develop new evaluation metrics based on human psychophysics experiments to systematically benchmark performance.

Result: Experimental results show that their diffusion model outperforms all competitive baselines. The model successfully identifies emotional cues from original images, edits images to elicit desired emotions, and preserves the semantic structure of the original images throughout the editing process.

Conclusion: The paper successfully addresses the challenge of emotion-evoked image generation by developing an effective diffusion model that balances emotional modification with semantic preservation. The contribution includes both methodological advances and valuable resources (dataset and evaluation metrics) for the research community.

Abstract: Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. First, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce a new evaluation metric to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and dataset are available at GitHub.

[171] Analysis of the 2024 BraTS Meningioma Radiotherapy Planning Automated Segmentation Challenge

Dominic LaBella, Valeriia Abramova, Mehdi Astaraki, Andre Ferreira, Zhifan Jiang, Mason C. Cleveland, Ramandeep Kang, Uma M. Lal-Trehan Estrada, Cansu Yalcin, Rachika E. Hamadache, Clara Lisazo, Adrià Casamitjana, Joaquim Salvi, Arnau Oliver, Xavier Lladó, Iuliana Toma-Dasu, Tiago Jesus, Behrus Puladi, Jens Kleesiek, Victor Alves, Jan Egger, Daniel Capellán-Martín, Abhijeet Parida, Austin Tapp, Xinyang Liu, Maria J. Ledesma-Carbayo, Jay B. Patel, Thomas N. McNeal, Maya Viera, Owen McCall, Albert E. Kim, Elizabeth R. Gerstner, Christopher P. Bridge, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R. Raleigh, Jonathan Shapey, Tom Vercauteren, Kazumi Chia, Marina Ivory, Theodore Barfoot, Omar Al-Salihi, Justin Leu, Lia M. Halasz, Yuri S. Velichko, Chunhao Wang, John P. Kirkpatrick, Scott R. Floyd, Zachary J. Reitman, Trey C. Mullikin, Eugene J. Vaios, Christina Huang, Ulas Bagci, Sean Sachdev, Jona A. Hattangadi-Gluth, Tyler M. Seibert, Nikdokht Farid, Connor Puett, Matthew W. Pease, Kevin Shiue, Syed Muhammad Anwar, Shahriar Faghani, Peter Taylor, Pranav Warman, Jake Albrecht, András Jakab, Mana Moassefi, Verena Chung, Rong Chai, Alejandro Aristizabal, Alexandros Karargyris, Hasan Kassem, Sarthak Pati, Micah Sheller, Nazanin Maleki, Rachit Saluja, Florian Kofler, Christopher G. Schwarz, Philipp Lohmann, Phillipp Vollmuth, Louis Gagnon, Maruf Adewole, Hongwei Bran Li, Anahita Fathi Kazerooni, Nourel Hoda Tahon, Udunna Anazodo, Ahmed W. Moawad, Bjoern Menze, Marius George Linguraru, Mariam Aboian, Benedikt Wiestler, Ujjwal Baid, Gian-Marco Conte, Andreas M. Rauschecker, Ayman Nada, Aly H. Abayazeed, Raymond Huang, Maria Correia de Verdier, Jeffrey D. Rudie, Spyridon Bakas, Evan Calabrese

Main category: cs.CV

TL;DR: The BraTS-MEN-RT 2024 challenge used the largest multi-institutional dataset of 750 brain MRIs to advance automated segmentation algorithms for meningioma radiotherapy planning, with six teams competing and achieving best results of 0.815 DSC and 26.92 mm Hausdorff Distance.

Details

Motivation: To advance automated segmentation algorithms for meningioma radiotherapy planning using a comprehensive multi-institutional dataset, aiming to enable precise tumor segmentation and facilitate tailored treatment to ultimately improve patient outcomes.

Method: Used the largest known multi-institutional dataset of 750 radiotherapy planning brain MRIs with expert-annotated target labels for meningioma patients. Each case included defaced 3D post-contrast T1-weighted MRIs with single-label target volume annotations representing gross tumor volume and at-risk post-operative sites. Six teams developed containerized automated segmentation models evaluated using modified lesion-wise Dice Similarity Coefficient and 95% Hausdorff Distance.

Result: The best performing team achieved an average lesion-wise Dice Similarity Coefficient of 0.815 and 95% Hausdorff Distance of 26.92 mm. Six participating teams successfully developed and evaluated automated segmentation models using the comprehensive dataset.

Conclusion: BraTS-MEN-RT is expected to significantly advance automated radiotherapy planning by enabling precise tumor segmentation and facilitating tailored treatment for meningioma patients, ultimately improving patient outcomes through better automated segmentation algorithms.

Abstract: The 2024 Brain Tumor Segmentation Meningioma Radiotherapy (BraTS-MEN-RT) challenge aimed to advance automated segmentation algorithms using the largest known multi-institutional dataset of 750 radiotherapy planning brain MRIs with expert-annotated target labels for patients with intact or postoperative meningioma that underwent either conventional external beam radiotherapy or stereotactic radiosurgery. Each case included a defaced 3D post-contrast T1-weighted radiotherapy planning MRI in its native acquisition space, accompanied by a single-label “target volume” representing the gross tumor volume (GTV) and any at-risk post-operative site. Target volume annotations adhered to established radiotherapy planning protocols, ensuring consistency across cases and institutions, and were approved by expert neuroradiologists and radiation oncologists. Six participating teams developed, containerized, and evaluated automated segmentation models using this comprehensive dataset. Team rankings were assessed using a modified lesion-wise Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (95HD). The best reported average lesion-wise DSC and 95HD was 0.815 and 26.92 mm, respectively. BraTS-MEN-RT is expected to significantly advance automated radiotherapy planning by enabling precise tumor segmentation and facilitating tailored treatment, ultimately improving patient outcomes. We describe the design and results from the BraTS-MEN-RT challenge.

[172] Rethinking Data Input for Point Cloud Upsampling

Tongxu Zhang

Main category: cs.CV

TL;DR: This paper investigates the differences between patch-based and whole model inputs for point cloud upsampling, proposing an Average Segment input approach and finding that patch-based methods consistently outperform whole model inputs on PU1K and ABC datasets.

Details

Motivation: Existing point cloud upsampling methods rely on patch-based inputs, but there is no research discussing the differences and principles between point cloud model full input and patch-based input approaches, creating a knowledge gap in understanding optimal input strategies.

Method: The authors propose a novel approach using whole model inputs called “Average Segment input” and conduct comparative experiments to analyze the differences between patch-based and whole model input strategies for point cloud upsampling.

Result: Experiments on PU1K and ABC datasets demonstrate that patch-based inputs consistently outperform whole model inputs in point cloud upsampling tasks, revealing the superiority of localized processing approaches.

Conclusion: The study concludes that patch-based inputs are more effective than whole model inputs for point cloud upsampling, and the authors analyze factors in feature extraction and network architecture that influence upsampling performance to understand this phenomenon.

Abstract: Point cloud upsampling is crucial for tasks like 3D reconstruction. While existing methods rely on patch-based inputs, and there is no research discussing the differences and principles between point cloud model full input and patch based input. Ergo, we propose a novel approach using whole model inputs i.e. Average Segment input. Our experiments on PU1K and ABC datasets reveal that patch-based inputs consistently outperform whole model inputs. To understand this, we will delve into factors in feature extraction, and network architecture that influence upsampling results.

[173] FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Xiang Gao, Jiaying Liu

Main category: cs.CV

TL;DR: This paper presents FBSDiff, a plug-and-play method that adapts pre-trained text-to-image diffusion models for image-to-image translation by using frequency band substitution in DCT spectral space, enabling controllable image editing without additional training.

Details

Motivation: Large-scale text-to-image diffusion models lack controllability for practical content creation applications. While these models can generate high-quality images from text prompts, users need better control over the generation process using reference images for real-world image editing and manipulation tasks.

Method: The authors propose a frequency band substitution approach that decomposes guiding factors using different frequency bands of diffusion features in the DCT spectral space. They introduce a novel frequency band substitution layer that enables dynamic control of reference image influence on text-to-image generation in a plug-and-play manner, without requiring model training, fine-tuning, or online optimization.

Result: The method demonstrates flexible control over both guiding factors and guiding intensity by adjusting the type and bandwidth of substituted frequency bands. Extensive qualitative and quantitative experiments show superior performance compared to related methods in terms of image-to-image translation visual quality, versatility, and controllability.

Conclusion: FBSDiff successfully enables high-quality, versatile text-driven image-to-image translation by leveraging frequency domain decomposition for controllable image generation. The plug-and-play nature makes it practical for real-world applications without computational overhead of additional training or optimization processes.

Abstract: Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.

[174] Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Shukang Yin, Chaoyou Fu, Sirui Zhao, Chunjiang Ge, Yan Yang, Yuhan Dai, Yongdong Luo, Tong Xu, Caifeng Shan, Enhong Chen

Main category: cs.CV

TL;DR: This paper proposes Sparrow, a data augmentation method that synthesizes video-like samples from text instructions to improve the training efficiency of video-LLMs, achieving comparable or better performance with fewer training samples.

Details

Motivation: The authors observed low learning efficiency when scaling up video data for training video-LLMs, which they attributed to lack of instruction diversity in existing automatic data pipelines. They aim to develop more efficient training methods for video-LLMs from a data-centric perspective.

Method: The authors propose Sparrow, a data augmentation method that synthesizes video-like samples from pure text instruction data. They fine-tune pre-trained image-LLMs with a mixture of real video data and these synthetic text-based samples to create a more efficient training scheme.

Result: Sparrow achieves performance comparable to or superior to baselines trained with significantly more video samples. Additionally, incorporating synthetic samples enhances long video understanding performance without requiring training on actual long video data.

Conclusion: The study demonstrates that synthetic text-based data augmentation can significantly improve training efficiency for video-LLMs by addressing instruction diversity issues, offering a more efficient alternative to simply scaling up video data volumes.

Abstract: Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.

[175] V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

Natchapon Jongwiriyanurak, Zichao Zeng, June Moh Goo, Xinglei Wang, Ilya Ilyankou, Kerkritt Sriroongvikrai, Nicola Christie, Meihui Wang, Huanfa Chen, James Haworth

Main category: cs.CV

TL;DR: This paper introduces V-RoAst, a zero-shot Visual Question Answering framework using Vision-Language Models to assess road safety attributes without requiring training data, addressing the costly road safety assessment problem in Low- and Middle-Income Countries.

Details

Motivation: Road safety assessments are critical but expensive, particularly in LMICs where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning approaches have poor cross-regional generalization. There's a need for automated, low-cost road safety assessment tools that can work without extensive training data.

Method: The authors develop V-RoAst, a zero-shot Visual Question Answering framework that leverages Vision-Language Models (VLMs) to classify road safety attributes according to iRAP standards. They create the first open-source ThaiRAP dataset with over 2,000 annotated street-level images from Thailand and evaluate Gemini-1.5-flash and GPT-4o-mini against VGGNet and ResNet baselines.

Result: VLMs showed limitations in spatial awareness but demonstrated good generalization to unseen classes and provided flexible prompt-based reasoning without requiring retraining. The results indicate that VLMs can function as automatic road assessment tools when combined with complementary data sources.

Conclusion: VLMs can serve as effective tools for automatic road safety assessment when integrated with additional data. This represents the first exploration of VLMs for zero-shot infrastructure risk assessment and establishes new directions for automated, cost-effective road safety mapping in resource-constrained environments.

Abstract: Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.

[176] VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

Kailai Feng, Yabo Zhang, Haodong Yu, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Wangmeng Zuo

Main category: cs.CV

TL;DR: VitaGlyph is a training-free method for artistic typography that treats characters as scenes with Subject and Surrounding components, using a three-phase framework to generate creative yet readable typographic art while maintaining controllable geometry changes.

Details

Motivation: Existing artistic typography methods using text-to-image diffusion models struggle to balance creativity and legibility when designing character geometry and texture, making it challenging to create both imaginative and readable typographic art.

Method: A dual-branch, three-phase framework: (1) Knowledge Acquisition uses large language models to create text descriptions for subject and surrounding, (2) Regional Interpretation detects matching parts and refines structure via Semantic Typography, and (3) Attentional Compositional Generation separately renders textures and blends them using attention mechanisms.

Result: VitaGlyph achieves better artistry and readability compared to existing methods, successfully depicts multiple customized concepts, and enables more creative and pleasing artistic typography generation while maintaining character legibility.

Conclusion: The paper presents VitaGlyph as an effective solution for artistic typography that successfully balances creativity and readability by treating characters as compositional scenes, demonstrating superior performance in generating flexible, controllable, and visually appealing typographic art.

Abstract: Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes while maintaining the readability. The key insight of VitaGlyph is to treat input character as a scene composed of a Subject and its Surrounding, which are rendered with varying degrees of geometric transformation. To enhance the visual appeal and creativity of the generated artistic typography, the subject flexibly expresses the essential concept of the input character, while the surrounding enriches relevant background without altering the shape, thus maintaining overall readability. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions for the subject and surrounding. (ii) Regional Interpretation detects the part that most closely matches the subject description and refines the structure via Semantic Typography. (iii) Attentional Compositional Generation separately renders the textures of the Subject and Surrounding regions and blends them in an attention-based manner. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability but also manages to depict multiple customized concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly available.

[177] Aligning AI with Public Values: Deliberation and Decision-Making for Governing Multimodal LLMs in Political Video Analysis

Tanusree Sharma, Yujin Potter, Zachary Kilhoffer, Yun Huang, Dawn Song, Yang Wang

Main category: cs.CV

TL;DR: This paper explores how to govern AI models on political topics by comparing expert journalists’ interpretations with public deliberation through a democratic platform, finding that experts focus on emotion/narrative while the public prioritizes factual clarity, and that different voting mechanisms significantly influence AI governance outcomes.

Details

Motivation: AI models handling political content presents governance challenges that require better frameworks. There's a need to understand how different stakeholders interpret politically sensitive content and how various democratic mechanisms can shape AI behavior decisions to ensure broader public engagement in AI development.

Method: Two-step study: (1) interviews with 10 journalists to establish expert baseline for video interpretation, and (2) deliberation study with 114 individuals using InclusiveAI platform with DAO mechanisms, comparing different governance approaches including quadratic vs. weighted voting and equal vs. 20/80 voting power distributions.

Result: Experts (journalists) emphasized emotion and narrative in video interpretation, while the general public prioritized factual clarity, objectivity, and emotional neutrality. Different voting mechanisms significantly influenced decision-making outcomes, with quadratic voting reinforcing perceptions of liberal democracy and political equality compared to other voting methods.

Conclusion: Appropriate governance mechanism selection is crucial for capturing diverse user perspectives in AI development. Decentralized AI governance through democratic deliberation platforms can facilitate broader public engagement and ensure that varied stakeholder perspectives meaningfully inform AI design decisions for handling political content.

Abstract: How AI models should deal with political topics has been discussed, but it remains challenging and requires better governance. This paper examines the governance of large language models through individual and collective deliberation, focusing on politically sensitive videos. We conducted a two-step study: interviews with 10 journalists established a baseline understanding of expert video interpretation; 114 individuals through deliberation using InclusiveAI, a platform that facilitates democratic decision-making through decentralized autonomous organization (DAO) mechanisms. Our findings reveal distinct differences in interpretative priorities: while experts emphasized emotion and narrative, the general public prioritized factual clarity, objectivity, and emotional neutrality. Furthermore, we examined how different governance mechanisms - quadratic vs. weighted voting and equal vs. 20/80 voting power - shape users’ decision-making regarding AI behavior. Results indicate that voting methods significantly influence outcomes, with quadratic voting reinforcing perceptions of liberal democracy and political equality. Our study underscores the necessity of selecting appropriate governance mechanisms to better capture user perspectives and suggests decentralized AI governance as a potential way to facilitate broader public engagement in AI development, ensuring that varied perspectives meaningfully inform design decisions.

[178] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

Main category: cs.CV

TL;DR: This paper introduces Prompt4Trust, a reinforcement learning framework that uses a lightweight LLM to generate context-aware prompts for improving confidence calibration in multimodal large language models (MLLMs) for healthcare applications, achieving better alignment between model confidence and actual accuracy while also improving task performance.

Details

Motivation: Multimodal large language models show promise in healthcare but face critical limitations: sensitivity to prompt design and tendency to generate incorrect responses with high confidence. Since clinicians rely on model confidence to assess prediction reliability, it's crucial that high confidence correlates with high accuracy for safe clinical decision-making.

Method: The authors develop Prompt4Trust, a reinforcement learning framework that trains a lightweight LLM to produce context-aware auxiliary prompts. These prompts guide downstream task MLLMs to generate responses where expressed confidence better reflects predictive accuracy, specifically prioritizing calibration aspects critical for clinical safety.

Result: The method achieves state-of-the-art performance on the PMC-VQA medical visual question answering benchmark across diverse medical imaging modalities. Additionally, the framework shows promising zero-shot generalization from smaller to larger MLLMs, suggesting scalable calibration without proportional computational costs.

Conclusion: Prompt4Trust demonstrates the potential of automated, human-aligned prompt engineering for improving MLLM trustworthiness in safety-critical healthcare settings. The framework successfully addresses both confidence calibration and task accuracy, offering a promising approach for deploying MLLMs more safely in clinical environments.

Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

[179] MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu, Linchao Zhu, Yi Yang

Main category: cs.CV

TL;DR: This paper introduces multi-context visual grounding, a new task for MLLMs to localize instances across multiple images using open-ended text prompts, and presents MC-Bench dataset with 2K annotated samples to evaluate this capability.

Details

Motivation: Current MLLMs have shown strong vision-language understanding but their abilities to solve instance-level visual-language problems beyond single images remain underexplored, warranting investigation into multi-image context understanding capabilities.

Method: The authors propose multi-context visual grounding task and construct MC-Bench dataset with 2K manually annotated image pairs and open-ended text prompts covering 20 practical skills in three distinct styles. They benchmark 20+ state-of-the-art MLLMs and develop agentic and fine-tuned baselines.

Result: Evaluation reveals a significant performance gap between existing MLLMs and humans on multi-context visual grounding tasks, with the benchmarking providing insightful observations about current model limitations and capabilities.

Conclusion: The study identifies substantial room for improvement in MLLMs’ multi-image instance-level understanding capabilities and provides MC-Bench as a valuable resource to encourage further research in advancing MLLMs’ untapped potential for multi-context visual tasks.

Abstract: While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further advance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: https://xuyunqiu.github.io/MC-Bench.

[180] MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

Xinjie Zhang, Zhening Liu, Yifan Zhang, Xingtong Ge, Dailan He, Tongda Xu, Yan Wang, Zehong Lin, Shuicheng Yan, Jun Zhang

Main category: cs.CV

TL;DR: This paper presents MEGA, a memory-efficient framework for 4D Gaussian Splatting that reduces storage requirements by up to 190× while maintaining rendering quality and speed through streamlined color representation and entropy-constrained Gaussian deformation.

Details

Motivation: 4D Gaussian Splatting faces significant memory and storage challenges due to requiring millions of 4D Gaussians with extensive attributes (up to 144 parameters for spherical harmonics coefficients), making it impractical for real-world applications despite its high-fidelity dynamic 3D scene capture capabilities.

Method: The framework introduces two key innovations: (1) streamlined color attribute decomposition into a 3-parameter per-Gaussian direct color component and a shared lightweight alternating current color predictor, eliminating spherical harmonics coefficients; (2) entropy-constrained Gaussian deformation technique using a deformation field to expand Gaussian action range and opacity-based entropy loss to minimize the number of required Gaussians.

Result: Achieved approximately 190× storage reduction on Technicolor dataset and 125× on Neural 3D Video dataset compared to original 4DGS, while maintaining comparable rendering speeds and scene representation quality. The method uses half-precision storage and zip compression for additional efficiency gains.

Conclusion: The proposed MEGA framework successfully addresses the memory bottleneck of 4DGS by creating a memory-efficient 4D Gaussian representation that dramatically reduces storage requirements without compromising rendering performance, setting a new standard for practical dynamic 3D scene representation.

Abstract: 4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190$\times$ and 125$\times$ on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field. Code is available at https://github.com/Xinjie-Q/MEGA.

[181] GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

Patrick Kwon, Chen Chen, Hanbyul Joo

Main category: cs.CV

TL;DR: GraspDiffusion is a novel generative method that creates realistic scenes of human-object interaction by first constructing life-like whole-body poses with controlled object placement, then using these poses to guide image synthesis for diverse and realistic human-object interaction scenes.

Details

Motivation: Current generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands due to the model's misunderstanding of such interactions and the difficulty of synthesizing intricate body regions.

Method: GraspDiffusion takes a 3D object mesh as input and constructs life-like whole-body poses by separately leveraging generative priors for 3D body and hand poses, then optimizing them into a joint grasping pose. The resulting pose guides image synthesis to correctly reflect the intended interaction.

Result: GraspDiffusion successfully tackles the problem of generating full-bodied human-object interactions and outperforms previous methods in creating realistic and diverse human-object interaction scenes.

Conclusion: The paper presents a successful approach to address the challenging problem of generating realistic human-object interactions by combining pose generation with image synthesis, demonstrating superior performance compared to existing methods.

Abstract: Recent generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands. This arises mostly from the model’s misunderstanding of such interactions, and the hardships of synthesizing intricate regions of the body. In this paper, we propose GraspDiffusion, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object mesh, GraspDiffusion first constructs life-like whole-body poses with control over the object’s location relative to the human body. This is achieved by separately leveraging the generative priors for 3D body and hand poses, optimizing them into a joint grasping pose. The resulting pose guides the image synthesis to correctly reflect the intended interaction, allowing the creation of realistic and diverse human-object interaction scenes. We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions while outperforming previous methods. Code and models will be available at https://webtoon.github.io/GraspDiffusion

[182] Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation

Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng

Main category: cs.CV

TL;DR: This paper proposes a personalized approach for multi-modal medical imaging that learns individual-specific invariant representations to improve generalization across different imaging modalities and patient populations, addressing the limitations of existing methods that only focus on common anatomical features.

Details

Motivation: Medical imaging modalities have distinct underlying principles creating modality gaps, and individual variations (organ size, metabolic rate) further impede generalization. Existing multi-modal approaches neglect individual differences and focus only on common anatomical features, resulting in weakened generalization across medical tasks.

Method: The authors propose learning personalized invariant representations (X_h) across various modalities by leveraging individual-level constraints and a learnable biological prior. This approach aims to capture individual-specific characteristics while maintaining generalizability across modalities.

Result: Extensive experimental results demonstrate that incorporating personalization significantly improves performance and generalization across diverse multi-modal medical scenarios. The learned personalized representations show high generalizability and transferability across various multi-modal medical tasks.

Conclusion: Personalization is critical for multi-modal generalization in medical imaging. The proposed approach of learning personalized invariant representations effectively addresses both modality gaps and individual variations, consistently improving performance across different medical tasks and confirming the effectiveness of personalized multi-modal learning.

Abstract: The differences among medical imaging modalities, driven by distinct underlying principles, pose significant challenges for generalization in multi-modal medical tasks. Beyond modality gaps, individual variations, such as differences in organ size and metabolic rate, further impede a model’s ability to generalize effectively across both modalities and diverse populations. Despite the importance of personalization, existing approaches to multi-modal generalization often neglect individual differences, focusing solely on common anatomical features. This limitation may result in weakened generalization in various medical tasks. In this paper, we unveil that personalization is critical for multi-modal generalization. Specifically, we propose an approach to achieve personalized generalization through approximating the underlying personalized invariant representation ${X}_h$ across various modalities by leveraging individual-level constraints and a learnable biological prior. We validate the feasibility and benefits of learning a personalized ${X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Extensive experimental results consistently show that the additionally incorporated personalization significantly improves performance and generalization across diverse scenarios, confirming its effectiveness.

[183] Watermark Anything with Localized Messages

Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze

Main category: cs.CV

TL;DR: The paper introduces WAM (Watermark Anything Model), a deep learning approach for localized image watermarking that can embed and extract watermarks from small regions of images, enabling detection of watermarked areas in spliced images and recovery of multiple distinct messages.

Details

Motivation: Existing image watermarking methods are not designed to handle small watermarked areas, which limits their use in real-world scenarios where images may come from different sources or have been partially edited. There's a need for watermarking that can work on localized regions rather than entire images.

Method: WAM consists of an embedder that imperceptibly modifies input images and an extractor that segments images into watermarked/non-watermarked areas and recovers hidden messages. The models are jointly trained at low resolution without perceptual constraints, then post-trained for imperceptibility and multiple watermarks handling.

Result: WAM achieves competitive performance with state-of-the-art methods in imperceptibility and robustness, particularly against inpainting and splicing attacks on high-resolution images. It can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions (≤10% of image surface) even on 256x256 images.

Conclusion: WAM successfully addresses the limitation of existing watermarking methods by enabling localized watermarking capabilities. It maintains competitive performance while offering new functionalities for detecting and extracting watermarks from small, specific regions of images, making it more suitable for real-world applications involving image editing and composition.

Abstract: Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions – no larger than 10% of the image surface – even for small 256x256 images. Training and inference code and model weights are available at https://github.com/facebookresearch/watermark-anything.

[184] Physically Consistent Image Augmentation for Deep Learning in Mueller Matrix Polarimetry

Christopher Hahne, Omar Rodriguez-Nunez, Éléa Gros, Théotim Lucas, Ekkehard Hewer, Tatiana Novikova, Theoni Maragkou, Philippe Schucht, Richard McKinley

Main category: cs.CV

TL;DR: This paper introduces a physics-based data augmentation framework for Mueller matrix polarimetry that preserves polarization properties, addressing the problem that standard augmentations can falsify results in polarimetric imaging for deep learning applications.

Details

Motivation: Standard data augmentation techniques like rotations and flips do not preserve polarization properties in Mueller matrix images, leading to falsified results when applied to polarimetric data. There is a need for physics-informed augmentation methods that maintain polarization fidelity while enhancing dataset diversity for deep learning models working with limited polarimetric datasets.

Method: The authors developed a versatile simulation framework that applies physically consistent rotations and flips to Mueller matrices while maintaining polarization fidelity. The method is tailored specifically for polarimetric data to preserve the essential polarization properties during augmentation transformations.

Result: Experimental validation showed that conventional augmentations can lead to falsified results in polarimetric data, while the proposed physics-based augmentations maintained consistency with real-world captures. When applied to semantic segmentation tasks, the method achieved substantial improvements in model generalization and performance, particularly benefiting datasets with limited sample sizes.

Conclusion: The study demonstrates the critical necessity of physics-informed data augmentation for polarimetric imaging in deep learning, providing a robust framework that enables broader adoption of DL models for polarimetric applications. The approach successfully unlocks the potential of deep learning for polarimetric datasets with limited samples and paves the way for more robust applications across diverse polarimetric research fields.

Abstract: Mueller matrix polarimetry captures essential information about polarized light interactions with a sample, presenting unique challenges for data augmentation in deep learning due to its distinct structure. While augmentations are an effective and affordable way to enhance dataset diversity and reduce overfitting, standard transformations like rotations and flips do not preserve the polarization properties in Mueller matrix images. To this end, we introduce a versatile simulation framework that applies physically consistent rotations and flips to Mueller matrices, tailored to maintain polarization fidelity. Our experimental results across multiple datasets reveal that conventional augmentations can lead to falsified results when applied to polarimetric data, underscoring the necessity of our physics-based approach. In our experiments, we first compare our polarization-specific augmentations against real-world captures to validate their physical consistency. We then apply these augmentations in a semantic segmentation task, achieving substantial improvements in model generalization and performance. This study underscores the necessity of physics-informed data augmentation for polarimetric imaging in deep learning (DL), paving the way for broader adoption and more robust applications across diverse research in the field. In particular, our framework unlocks the potential of DL models for polarimetric datasets with limited sample sizes. Our code implementation is available at github.com/hahnec/polar_augment.

[185] Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

Main category: cs.CV

TL;DR: This paper introduces OVQ2D, a challenging task for real-time episodic memory retrieval in wearable cameras that processes video streams online with limited memory, and proposes ESOM framework that achieves ~4% success rate but shows potential for significant improvement with better object tracking and discovery components.

Details

Motivation: Existing episodic memory systems assume offline settings with full video access, which is impractical for power and storage-constrained wearable devices in real-world scenarios. There's a need for online systems that can process video streams in real-time while maintaining compact memory for object retrieval.

Method: The paper proposes ESOM (Egocentric Streaming Object Memory), a framework with three key modules: (1) object discovery module to find objects, (2) object tracking module to track objects across frames, and (3) memory module to store spatio-temporal object information efficiently for querying without requiring full video history access.

Result: ESOM outperforms other online approaches on Ego4D dataset but achieves only ~4% success rate, indicating the task’s difficulty. Performance analysis shows dramatic improvements with perfect components: 31.91% with perfect tracking, 40.55% with perfect discovery, and 81.92% with both perfect tracking and discovery.

Conclusion: OVQ2D represents a challenging but important step toward practical episodic memory systems for wearable devices. While ESOM shows promise and superiority over existing online methods, the low success rates highlight the critical need for improved object tracking and discovery components to make such systems viable for real-world applications.

Abstract: Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an “offline” setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM’s superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM’s accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

[186] RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment

Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas

Main category: cs.CV

TL;DR: RadAlign is a novel framework that combines vision-language models (VLMs) with large language models (LLMs) to achieve both accurate chest X-ray disease classification (AUC 0.885) and high-quality radiology report generation (GREEN score 0.678), mimicking radiologist workflow while maintaining interpretability and reducing hallucinations.

Details

Motivation: Current automated chest radiograph interpretation approaches face a trade-off between classification accuracy and interpretability - they either focus on accurate disease classification at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning. There's a need for a solution that combines both accurate disease classification and reliable, detailed report generation to support clinical workflow.

Method: RadAlign employs a two-stage approach inspired by radiologist workflow: (1) A specialized vision-language model (VLM) aligns visual features with key medical concepts for disease classification, and (2) The recognized medical conditions (as text-based concepts in aligned visual-language space) prompt an LLM-based report generation system. The framework is enhanced with a retrieval-augmented generation mechanism that grounds outputs in similar historical cases to improve report quality and reduce hallucinations.

Result: RadAlign achieved superior disease classification performance with an average AUC of 0.885 across multiple diseases. For report generation, it delivered a GREEN score of 0.678, outperforming state-of-the-art methods that achieved 0.634. The framework maintained strong clinical interpretability while reducing hallucinations in generated reports.

Conclusion: RadAlign successfully advances automated medical imaging and report analysis by integrating predictive and generative AI in a framework that mimics radiologist workflow. It demonstrates that combining VLMs for accurate disease classification with LLMs for report generation, enhanced by retrieval-augmented generation, can achieve both high accuracy and interpretability while reducing unreliable outputs in clinical applications.

Abstract: Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist’s workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods’ 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.

[187] MCP-MedSAM: A Powerful Lightweight Medical Segment Anything Model Trained with a Single GPU in Just One Day

Donghang Lyu, Ruochen Gao, Marius Staring

Main category: cs.CV

TL;DR: The paper proposes MCP-MedSAM, a lightweight medical image segmentation model that can be trained on a single A100 GPU in one day while achieving superior performance compared to existing methods by introducing modality and content prompts.

Details

Motivation: Existing Segmentation Anything Model (SAM) has large model size and high GPU requirements that hinder its scalability in medical domain applications, creating a need for a more efficient yet powerful medical segmentation solution.

Method: The authors develop MCP-MedSAM with two key innovations: (1) introduction of modality prompts and content prompts that are processed through a prompt encoder to improve segmentation performance, and (2) implementation of a modality-based data sampling strategy to address data imbalance between different medical imaging modalities.

Result: MCP-MedSAM achieved superior performance compared to top-ranking methods on challenge leaderboards while requiring only one day of training on a single A100 GPU with 40GB memory, demonstrating both efficiency and effectiveness.

Conclusion: The proposed MCP-MedSAM successfully addresses the computational limitations of existing medical segmentation models while maintaining superior performance, making it more accessible and practical for medical domain applications with limited computational resources.

Abstract: Medical image segmentation involves partitioning medical images into meaningful regions, with a focus on identifying anatomical structures and lesions. It has broad applications in healthcare, and deep learning methods have enabled significant advancements in automating this process. Recently, the introduction of the Segmentation Anything Model (SAM), the first foundation model for segmentation task, has prompted researchers to adapt it for the medical domain to improve performance across various tasks. However, SAM’s large model size and high GPU requirements hinder its scalability and development in the medical domain. In this work, we propose MCP-MedSAM, a powerful and lightweight medical SAM model designed to be trainable on a single A100 GPU with 40GB of memory within one day while delivering superior segmentation performance. Recognizing the significant internal differences between modalities and the need for direct segmentation target information within bounding boxes, we introduce two kinds of prompts: the modality prompt and the content prompt. After passing through the prompt encoder, their embedding representations can further improve the segmentation performance by incorporating more relevant information without adding significant training overhead. Additionally, we adopt an effective modality-based data sampling strategy to address data imbalance between modalities, ensuring more balanced performance across all modalities. Our method was trained and evaluated using a large-scale challenge dataset, compared to top-ranking methods on the challenge leaderboard, MCP-MedSAM achieved superior performance while requiring only one day of training on a single GPU. The code is publicly available at \textcolor{blue}{https://github.com/dong845/MCP-MedSAM}.}

[188] FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, Tomer Michaeli

Main category: cs.CV

TL;DR: FlowEdit is an inversion-free, optimization-free text-based image editing method for pre-trained T2I flow models that directly maps between source and target distributions using ODEs, achieving state-of-the-art results on Stable Diffusion 3 and FLUX.

Details

Motivation: Existing image editing methods using pre-trained T2I diffusion models rely on image inversion into noise maps, which is insufficient alone and requires additional sampling process interventions. These methods are not seamlessly transferable between different model architectures, creating a need for a more universal approach.

Method: FlowEdit constructs an Ordinary Differential Equation (ODE) that directly maps between source and target distributions corresponding to source and target text prompts. This approach bypasses the need for inversion or optimization while being model agnostic across different T2I flow model architectures.

Result: The method achieves state-of-the-art results on Stable Diffusion 3 and FLUX models. FlowEdit demonstrates lower transport cost compared to inversion-based approaches while maintaining model transferability across different architectures.

Conclusion: FlowEdit successfully addresses the limitations of existing text-based image editing methods by providing an inversion-free, optimization-free solution that works across different T2I flow model architectures while achieving superior performance through direct distribution mapping via ODEs.

Abstract: Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX. Code and examples are available on the project’s webpage.

[189] Conformal Predictions for Human Action Recognition with Vision-Language Models

Bary Tim, Fuchs Clément, Macq Benoît

Main category: cs.CV

TL;DR: This paper applies Conformal Prediction techniques to Vision-Language Models for human action recognition in Human-in-the-Loop systems, proposing temperature tuning to reduce candidate classes while maintaining reliability without additional calibration data.

Details

Motivation: High-stakes real-world applications require reliable AI systems that can effectively collaborate with human decision-makers. Existing human action recognition systems using Vision-Language Models lack rigorous uncertainty quantification and coverage guarantees needed for safe human-AI interaction.

Method: The authors apply Conformal Prediction techniques to provide rigorous coverage guarantees for Vision-Language Models in human action recognition tasks. They propose temperature tuning of softmax predictions to address the long-tail distribution problem that arises when reducing candidate classes, without requiring additional calibration data.

Result: Conformal Prediction successfully reduces the average number of candidate classes in human action recognition without modifying the underlying Vision-Language Model. However, these reductions create long-tail distributions that can limit practical utility, which is mitigated through the proposed temperature tuning approach.

Conclusion: The integration of Conformal Prediction with temperature tuning enhances the reliability of Vision-Language Models for human action recognition in Human-in-the-Loop systems, contributing to more effective multi-modal human-AI interaction in dynamic real-world environments without requiring additional calibration data.

Abstract: Human-in-the-Loop (HITL) systems are essential in high-stakes, real-world applications where AI must collaborate with human decision-makers. This work investigates how Conformal Prediction (CP) techniques, which provide rigorous coverage guarantees, can enhance the reliability of state-of-the-art human action recognition (HAR) systems built upon Vision-Language Models (VLMs). We demonstrate that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails which can hinder their practical utility. To mitigate this, we propose tuning the temperature of the softmax prediction, without using additional calibration data. This work contributes to ongoing efforts for multi-modal human-AI interaction in dynamic real-world environments.

[190] Do large language vision models understand 3D shapes?

Sagi Eppel

Main category: cs.CV

TL;DR: This paper evaluates large vision language models’ (LVLMs) ability to understand 3D shapes by testing their capacity to match objects with identical 3D geometry but different orientations and materials, finding that while models show some 3D understanding, they significantly underperform humans especially when both orientation and material change simultaneously.

Details

Motivation: To investigate whether current LVLMs truly understand 3D shapes, which are fundamental building blocks of visual perception, by systematically testing their ability to recognize shape invariance across different viewing conditions and surface properties.

Method: Created a large dataset of CGI-rendered test images featuring diverse 3D objects with varying orientations, materials, and textures. Tested LVLMs (GPT, Claude, Gemini, LLama) on shape matching tasks where models had to identify objects with identical 3D geometry despite changes in viewpoint and surface appearance.

Result: LVLMs performed significantly below human level but above random chance in 3D shape matching. Models could handle orientation changes or material changes individually, but performance dropped substantially when both orientation and material were changed simultaneously.

Conclusion: Current LVLMs have developed some abstract understanding of 3D shapes but still fall far short of human-level 3D perception, particularly struggling with the combined challenges of orientation and material variation, indicating limitations in their geometric understanding capabilities.

Abstract: Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. A large number of test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some abstract understanding of 3D shapes but still trail far beyond humans in this task. Mainly it seems that the models can easily identify the same object with a different orientation as well as matching identical 3D shapes of the same orientation but with different materials and textures. However, when both the object material and orientation are changed, all models perform poorly relative to humans. Code and benchmark are available.

[191] Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling

Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang

Main category: cs.CV

TL;DR: This paper proposes a geometric analysis framework to understand why deep neural networks exhibit biases toward certain object categories, linking these biases to differences in geometric complexity of class-specific perceptual manifolds, and introduces a library for calculating manifold geometric properties.

Details

Motivation: Deep neural networks show unexplained biases toward certain object categories even with balanced training data. The underlying mechanisms causing these biases are unclear, motivating the need for a theoretical framework to understand this phenomenon through geometric analysis of perceptual manifolds.

Method: The authors develop a geometric analysis framework inspired by the human visual system that examines the geometric complexity of class-specific perceptual manifolds in DNNs. They create the Perceptual-Manifold-Geometry library to calculate geometric properties of these manifolds and analyze how geometric complexity relates to model bias.

Result: The study reveals that differences in geometric complexity of perceptual manifolds can lead to varying recognition capabilities across different object categories, which introduces biases in deep neural networks. The research provides empirical evidence linking manifold geometry to recognition performance disparities.

Conclusion: The geometric complexity of class-specific perceptual manifolds in DNNs is a key factor underlying category biases in object recognition. This geometric perspective provides new insights into DNN bias mechanisms and offers a computational tool for analyzing manifold properties to better understand and potentially mitigate these biases.

Abstract: Deep neural networks (DNNs) often exhibit biases toward certain categories during object recognition, even under balanced training data conditions. The intrinsic mechanisms underlying these biases remain unclear. Inspired by the human visual system, which decouples object manifolds through hierarchical processing to achieve object recognition, we propose a geometric analysis framework linking the geometric complexity of class-specific perceptual manifolds in DNNs to model bias. Our findings reveal that differences in geometric complexity can lead to varying recognition capabilities across categories, introducing biases. To support this analysis, we present the Perceptual-Manifold-Geometry library, designed for calculating the geometric properties of perceptual manifolds.

[192] Predicting the Reliability of an Image Classifier under Image Distortion

Dang Nguyen, Sunil Gupta, Kien Do, Svetha Venkatesh

Main category: cs.CV

TL;DR: This paper proposes a Gaussian process-based method to predict whether image classifiers will be reliable or unreliable under different distortion levels, addressing the class imbalance problem in training data.

Details

Motivation: Deep learning models for image classification are vulnerable to image distortions, causing significant accuracy drops. For quality control purposes, it's important to predict whether an image classifier will remain reliable under specific distortion levels, but this creates a highly imbalanced training dataset problem.

Method: The authors construct a training dataset with distortion levels labeled as “reliable” or “non-reliable” based on whether classifier accuracy exceeds a user-specified threshold. They then train a machine learning model (distortion-classifier) to predict reliability for unseen distortion levels. To handle the highly imbalanced training data, they propose a Gaussian process-based method to rebalance the training set.

Result: Extensive experiments on six popular image datasets demonstrate that their Gaussian process-based rebalancing method significantly outperforms several baseline approaches for predicting classifier reliability under distortions.

Conclusion: The proposed Gaussian process-based rebalancing approach effectively addresses the class imbalance problem in predicting image classifier reliability under distortions, providing a valuable tool for quality control in deep learning applications.

Abstract: In image classification tasks, deep learning models are vulnerable to image distortions i.e. their accuracy significantly drops if the input images are distorted. An image-classifier is considered “reliable” if its accuracy on distorted images is above a user-specified threshold. For a quality control purpose, it is important to predict if the image-classifier is unreliable/reliable under a distortion level. In other words, we want to predict whether a distortion level makes the image-classifier “non-reliable” or “reliable”. Our solution is to construct a training set consisting of distortion levels along with their “non-reliable” or “reliable” labels, and train a machine learning predictive model (called distortion-classifier) to classify unseen distortion levels. However, learning an effective distortion-classifier is a challenging problem as the training set is highly imbalanced. To address this problem, we propose a Gaussian process based method to rebalance the training set. We conduct extensive experiments to show that our method significantly outperforms several baselines on six popular image datasets.

[193] PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel

Main category: cs.CV

TL;DR: PRISM is a framework that uses Stable Diffusion to generate high-resolution, language-guided medical image counterfactuals, enabling precise modification of spurious correlations and disease features to develop more robust medical imaging classifiers.

Details

Motivation: Medical imaging deep learning systems face significant challenges including spurious correlations, data imbalances, and limited text annotations. There is a need to adapt vision-language foundation models from natural images to medical imaging tasks to address these unique complexities and improve system reliability and generalizability.

Method: The authors present PRISM, a framework that leverages foundation models and Stable Diffusion to generate high-resolution, language-guided medical image counterfactuals. The approach can selectively modify spurious correlations (like medical devices) and disease features while preserving other image characteristics.

Result: PRISM demonstrates unprecedented precision in counterfactual generation, successfully removing and adding specific attributes in medical images. The framework advances counterfactual generation capabilities and enables the development of more robust downstream classifiers suitable for clinical deployment.

Conclusion: PRISM successfully addresses key challenges in medical imaging by providing a robust framework for generating precise counterfactuals that can improve classifier robustness. The publicly available code facilitates broader adoption and research in this area, potentially leading to more clinically deployable medical imaging solutions.

Abstract: Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures that are robust to the unique complexities posed by medical imaging data. Rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.

[194] EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, Nan Cao

Main category: cs.CV

TL;DR: The paper introduces EmotiCrafter, a novel model for continuous emotional image content generation (C-EICG) that generates images based on text prompts and Valence-Arousal values, overcoming limitations of discrete emotion categories in existing methods.

Details

Motivation: Existing emotional image generation methods rely on discrete emotion categories, making it difficult to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control specific image content based on text prompts, limiting their effectiveness for emotionally rich content generation.

Method: The authors propose EmotiCrafter with a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling alignment between specific emotions and input prompts. They also introduce a specialized loss function to enhance emotion expression in generated images.

Result: Experimental results demonstrate that EmotiCrafter effectively generates images representing specific emotions with desired content and outperforms existing techniques in emotional image generation tasks.

Conclusion: The paper successfully addresses the limitations of discrete emotion-based image generation by introducing continuous emotional control through Valence-Arousal values, enabling more nuanced and controllable emotional image content generation.

Abstract: Recent research shows that emotions can enhance users’ cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.

[195] MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu Hu, De Wen Soh, Na Zhao

Main category: cs.CV

TL;DR: MotionLab introduces a unified Motion-Condition-Motion paradigm for human motion generation and editing, using rectified flows and a MotionFlow Transformer to handle diverse motion tasks without task-specific modules, demonstrating strong generalization and efficiency across benchmarks.

Details

Motivation: Current human motion generation approaches offer isolated, task-specific solutions that lack editing capabilities, fine-grained control, and knowledge sharing across tasks. Existing unified methods only use different modalities as conditions but still cannot perform editing or facilitate cross-task learning effectively.

Method: The paper proposes the Motion-Condition-Motion paradigm with three concepts (source motion, condition, target motion) and MotionLab framework featuring: (1) MotionFlow Transformer for conditional generation/editing, (2) Aligned Rotational Position Encoding for time synchronization, (3) Task Specified Instruction Modulation, and (4) Motion Curriculum Learning for multi-task knowledge sharing.

Result: MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple human motion benchmarks, successfully unifying both generation and editing tasks without requiring task-specific modules.

Conclusion: The Motion-Condition-Motion paradigm successfully addresses limitations of existing approaches by providing a versatile, unified framework that enables both human motion generation and editing with enhanced control, knowledge sharing, and generalization across diverse motion-related tasks.

Abstract: Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: \textbf{Motion-Condition-Motion}, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, \textbf{MotionLab}, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.

[196] Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks

Eylon Mizrahi, Raz Lapid, Moshe Sipper

Main category: cs.CV

TL;DR: The paper proposes U-CAN, an unsupervised adversarial detection method that uses contrastive auxiliary networks embedded in intermediate layers to distinguish between benign and adversarial inputs without requiring adversarial examples during training.

Details

Motivation: Deep learning models in safety-critical applications are vulnerable to adversarial attacks (imperceptible perturbations that degrade performance), and existing defense mechanisms focus on either enhancing robustness or detecting adversarial inputs separately, creating a need for better detection methods.

Method: U-CAN embeds auxiliary networks (comprising projection layers and ArcFace-based linear layers) within selected intermediate layers of target models to refine feature representations and enable unsupervised detection of adversarial behavior in auxiliary feature space.

Result: Comprehensive experiments across CIFAR-10, Mammals, and ImageNet subset using ResNet-50, VGG-16, and ViT architectures show U-CAN achieves superior F1 scores compared to existing unsupervised adversarial detection techniques against four distinct attack methods.

Conclusion: U-CAN provides a scalable and effective framework for enhancing security and reliability of deep learning systems through unsupervised adversarial detection, outperforming existing methods across multiple datasets and architectures.

Abstract: Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks – imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.

[197] ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu

Main category: cs.CV

TL;DR: ETCH proposes a novel pipeline for fitting body models to 3D clothed human point clouds using equivariant tightness mapping that encodes cloth-to-body displacement vectors, achieving superior performance over existing methods especially for loose clothing scenarios.

Details

Motivation: Traditional optimization-based approaches for clothed human fitting are sensitive to pose initialization and use complex multi-stage pipelines, while recent learning-based methods struggle with generalization across diverse poses and garment types, creating a need for a more robust and generalizable solution.

Method: The method introduces Equivariant Tightness Fitting for Clothed Humans (ETCH), which estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from cloth surface to underlying body. It then uses pose-invariant body features to regress sparse body markers, converting the complex clothed human fitting into a simpler inner-body marker fitting task.

Result: ETCH significantly outperforms state-of-the-art methods on CAPE and 4D-Dress datasets, showing 16.7%-69.5% improvement in body fitting accuracy for loose clothing and average 49.9% improvement in shape accuracy. The equivariant tightness design reduces directional errors by 67.2%-89.8% in one-shot settings with only ~1% training data.

Conclusion: ETCH demonstrates strong generalization capabilities across challenging poses, unseen shapes, loose clothing, and non-rigid dynamics, establishing a new state-of-the-art for clothed human fitting through its novel equivariant tightness encoding approach that simplifies the fitting problem while maintaining high accuracy.

Abstract: Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods – both tightness-agnostic and tightness-aware – in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.

[198] GS-TransUNet: Integrated 2D Gaussian Splatting and Transformer UNet for Accurate Skin Lesion Analysis

Anand Kumar, Kavinder Roghit Kanthen, Josna John

Main category: cs.CV

TL;DR: The paper introduces GS-TransUNet, a unified deep learning model that combines 2D Gaussian splatting with Transformer UNet architecture to simultaneously perform skin lesion segmentation and classification for automated skin cancer diagnosis, achieving superior performance on ISIC-2017 and PH2 datasets.

Details

Motivation: Existing skin lesion segmentation and classification models operate independently, missing potential efficiencies from integrated execution. There is a need to unify skin lesion analysis to achieve faster and more consistent early skin cancer detection while leveraging recent advances in computer vision and deep learning.

Method: The authors propose GS-TransUNet (Gaussian Splatting - Transformer UNet), which synergistically combines 2D Gaussian splatting with the Transformer UNet architecture. This creates a unified deep learning model that can efficiently perform dual-function skin lesion classification and segmentation simultaneously.

Result: The model demonstrates superior performance compared to existing state-of-the-art models across multiple metrics when evaluated on ISIC-2017 and PH2 datasets using 5-fold cross-validation. The approach shows significant advancements in both segmentation and classification precision.

Conclusion: The integration of segmentation and classification tasks in a unified model sets new benchmarks in the field and highlights the potential for further research into multi-task medical image analysis methodologies, promising enhancements in automated diagnostic systems for skin cancer detection.

Abstract: We can achieve fast and consistent early skin cancer detection with recent developments in computer vision and deep learning techniques. However, the existing skin lesion segmentation and classification prediction models run independently, thus missing potential efficiencies from their integrated execution. To unify skin lesion analysis, our paper presents the Gaussian Splatting - Transformer UNet (GS-TransUNet), a novel approach that synergistically combines 2D Gaussian splatting with the Transformer UNet architecture for automated skin cancer diagnosis. Our unified deep learning model efficiently delivers dual-function skin lesion classification and segmentation for clinical diagnosis. Evaluated on ISIC-2017 and PH2 datasets, our network demonstrates superior performance compared to existing state-of-the-art models across multiple metrics through 5-fold cross-validation. Our findings illustrate significant advancements in the precision of segmentation and classification. This integration sets new benchmarks in the field and highlights the potential for further research into multi-task medical image analysis methodologies, promising enhancements in automated diagnostic systems.

[199] One-for-More: Continual Diffusion Model for Anomaly Detection

Xiaofan Li, Xin Tan, Zhuo Chen, Zhizhong Zhang, Ruixin Zhang, Rizen Guo, Guannan Jiang, Yulong Chen, Yanyun Qu, Lizhuang Ma, Yuan Xie

Main category: cs.CV

TL;DR: This paper proposes a continual diffusion model for anomaly detection that addresses “faithfulness hallucination” and “catastrophic forgetting” issues through gradient projection and iterative singular value decomposition, achieving state-of-the-art performance on continual anomaly detection benchmarks.

Details

Motivation: The paper is motivated by two key problems in applying diffusion models to anomaly detection: (1) "faithfulness hallucination" and "catastrophic forgetting" which prevent the model from handling unpredictable pattern increments, and (2) the need for a unified generative framework that can continuously learn new anomaly patterns without forgetting previous knowledge.

Method: The authors propose a continual diffusion model with three key components: (1) gradient projection for stable continual learning that regularizes model updates to protect learned knowledge, (2) an iterative singular value decomposition method based on transitive property of linear representation to reduce memory costs from the Markov process, and (3) an anomaly-masked network to enhance the condition mechanism and prevent overfitting to normal images.

Result: The proposed method achieves first place in 17 out of 18 settings on MVTec and VisA datasets for continual anomaly detection, demonstrating superior performance compared to existing approaches while maintaining low memory consumption and minimal performance loss.

Conclusion: The paper successfully addresses the major limitations of diffusion models in continual anomaly detection by introducing gradient projection with memory-efficient SVD decomposition and anomaly-masked conditioning, enabling effective continual learning without catastrophic forgetting while achieving state-of-the-art performance on benchmark datasets.

Abstract: With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe faithfulness hallucination'' and catastrophic forgetting’’, which can’t meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting’’ to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/One-for-More

[200] MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions

Qingyuan Zhou, Yuehu Gong, Weidong Yang, Jiaze Li, Yeqi Luo, Baixin Xu, Shuhao Li, Ben Fei, Ying He

Main category: cs.CV

TL;DR: MGSR proposes a dual-branch approach combining 2D-GS and 3D-GS with mutual supervision to simultaneously achieve high-quality novel view synthesis and accurate surface reconstruction in 3D Gaussian Splatting, breaking the traditional trade-off between rendering quality and reconstruction accuracy.

Details

Motivation: Existing 3D Gaussian Splatting methods face a fundamental trade-off where GS-based rendering methods struggle with diverse lighting conditions and fail to produce accurate surfaces, while GS-based reconstruction methods compromise rendering quality. The paper seeks to address whether rendering and reconstruction must always involve this trade-off.

Method: MGSR introduces a dual-branch architecture with 2D-GS and 3D-GS branches that provide mutual supervision. The 2D-GS branch specializes in surface reconstruction and provides geometry information to the 3D-GS branch. The 3D-GS branch uses a geometry-guided illumination decomposition module to capture reflected and transmitted components for realistic rendering. Both branches undergo alternating optimization with independent warm-up phases and early stopping strategies.

Result: MGSR demonstrates strong performance in both rendering and surface reconstruction tasks across diverse synthetic and real-world datasets at object and scene levels, successfully achieving high-quality results in both domains without the traditional trade-off.

Conclusion: The paper successfully demonstrates that the trade-off between rendering quality and reconstruction accuracy in 3D Gaussian Splatting can be overcome through mutual supervision between specialized 2D-GS and 3D-GS branches, enabling simultaneous high-fidelity novel view synthesis and accurate surface reconstruction.

Abstract: Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches–one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction. Code is available at https://github.com/TsingyuanChou/MGSR.

[201] ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li

Main category: cs.CV

TL;DR: ViP²-CLIP introduces a Visual-Perception Prompting mechanism for zero-shot anomaly detection that adaptively generates fine-grained textual prompts from visual context, eliminating the need for manual templates and achieving state-of-the-art performance across 15 benchmarks.

Details

Motivation: Existing CLIP-based zero-shot anomaly detection methods rely on handcrafted prompts (high engineering costs, limited coverage) or static learnable prompts (fail to adapt to diverse anomaly types). CLIP's sensitivity to exact class name wording further constrains prompting strategies, especially when category labels are ambiguous or privacy-constrained.

Method: The paper proposes ViP²-CLIP with a Visual-Perception Prompting (ViP-Prompt) mechanism that fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, removing dependence on manual templates and class-name priors.

Result: ViP²-CLIP achieves state-of-the-art performance on 15 industrial and medical benchmarks, demonstrating robust cross-domain generalization capabilities for zero-shot anomaly detection tasks.

Conclusion: The Visual-Perception Prompting mechanism successfully addresses the limitations of existing CLIP-based anomaly detection methods by enabling adaptive, fine-grained prompt generation from visual context, making it particularly valuable for scenarios with ambiguous labels or privacy constraints.

Abstract: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model’s ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

[202] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Xiangxiang Chu, Renda Li, Yong Wang

Main category: cs.CV

TL;DR: The paper proposes Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models using masked latent modeling in VAE latent space to improve both convergence speed and generation quality while maintaining performance on understanding tasks.

Details

Motivation: Recent studies show the interplay between diffusion models and representation learning, but transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and latent space usage. There's a need for a unified approach that can leverage self-supervised pretraining for diffusion models effectively.

Method: The authors propose Unified Self-supervised Pretraining (USP), which initializes diffusion models through masked latent modeling in a Variational Autoencoder (VAE) latent space. This approach addresses the input mismatch problem and enables effective transfer of pretrained representations.

Result: USP achieves comparable performance on understanding tasks while significantly improving the convergence speed and generation quality of diffusion models compared to standard initialization methods.

Conclusion: The USP framework successfully bridges the gap between self-supervised vision models and diffusion models, providing a unified pretraining approach that enhances both convergence efficiency and generation quality without sacrificing understanding task performance.

Abstract: Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at https://github.com/AMAP-ML/USP.

[203] DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation

Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: DOFA-CLIP is a unified vision-language foundation model that dynamically adapts to multiple Earth observation modalities (optical, radar, multispectral, hyperspectral) through a single Transformer backbone, achieving state-of-the-art zero-shot performance across diverse EO benchmarks.

Details

Motivation: Current vision-language models in Earth observation are limited to individual modalities, restricting their generalization and scalability across diverse tasks. There's a need for a unified model that can handle multiple EO modalities with flexible spectral configurations.

Method: The paper introduces three key components: 1) GeoLangBind-2M - a large-scale EO image-text dataset covering six heterogeneous modalities, 2) VECT (Vision-models Enhanced Contrastive Text-image pretraining) - a training strategy that enhances spatial awareness using multiple vision foundation models, and 3) MaKA (Modality-aware Knowledge Agglomeration) module for modality-specific feature distillation refinement.

Result: DOFA-CLIP achieves state-of-the-art zero-shot performance across a wide range of EO benchmarks, demonstrating effectiveness on unseen modalities and diverse input spectral bands. The model successfully generalizes across multiple Earth observation tasks.

Conclusion: The work establishes a scalable foundation for multimodal Earth observation understanding and opens new avenues for integrating heterogeneous EO data with large language models, providing a unified approach to handle diverse Earth observation modalities.

Abstract: Earth observation (EO) spans a broad spectrum of modalities, including optical, radar, multispectral, and hyperspectral data, each capturing distinct environmental signals. However, current vision-language models in EO, particularly CLIP-based variants, remain confined to individual modalities, limiting generalization and scalability across diverse tasks. We present DOFA-CLIP (Dynamic-One-For-All CLIP), a unified vision-language foundation model that dynamically adapts to EO modalities with flexible spectral configurations through a single Transformer backbone. Our approach introduces three key contributions: 1) the construction of GeoLangBind-2M, a large-scale EO image-text dataset covering six heterogeneous modalities with rich natural language descriptions; 2) a novel training strategy called VECT (Vision-models Enhanced Contrastive Text-image pretraining), which enhances the spatial awareness of CLIP features with multiple vision foundation models; and 3) a Modality-aware Knowledge Agglomeration (MaKA) module that refines feature distillation with modality-specific awareness. DOFA-CLIP achieves state-of-the-art zero-shot performance across a wide range of EO benchmarks, including unseen modalities and a diverse number of input spectral bands. Together, these contributions establish a scalable foundation for multimodal EO understanding and open new avenues for integrating heterogeneous EO data with large language models. Code and datasets will be released. Code and datasets are publicly available.

[204] Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild

S M A Sharif, Abdur Rehman, Zain Ul Abidin, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi

Main category: cs.CV

TL;DR: This paper introduces the Low-Light Smartphone Dataset (LSD), a large-scale 4K+ dataset with 6,425 precisely aligned low/normal-light image pairs, and proposes TFFormer, a hybrid model that separately processes luminance and chrominance components with cross-attention fusion for enhanced low-light image enhancement, achieving state-of-the-art results.

Details

Motivation: Single-shot low-light image enhancement faces challenges due to limited availability of diverse, real-world paired datasets. Existing datasets lack the scale and quality needed for robust model training, particularly for real-world scenarios with varying lighting conditions.

Method: The authors create LSD dataset with 6,425 aligned low/normal-light pairs from 8,000+ scenes (0.1-200 lux) and propose TFFormer, a hybrid model that: (1) encodes luminance and chrominance separately to reduce color-structure entanglement, (2) uses cross-attention-driven joint decoder for context-aware LC fusion, (3) employs LC refinement and LC-guided supervision for improved perceptual fidelity.

Result: TFFormer achieves state-of-the-art performance on LSD with +2.45 dB PSNR improvement. The method also significantly enhances downstream vision tasks, showing +6.80 mAP improvement on ExDark dataset for low-light object detection tasks.

Conclusion: The combination of the large-scale, high-quality LSD dataset and the TFFormer architecture successfully addresses key challenges in low-light image enhancement, demonstrating superior performance in both image quality metrics and practical downstream applications like object detection.

Abstract: Single-shot low-light image enhancement (SLLIE) remains challenging due to the limited availability of diverse, real-world paired datasets. To bridge this gap, we introduce the Low-Light Smartphone Dataset (LSD), a large-scale, high-resolution (4K+) dataset collected in the wild across a wide range of challenging lighting conditions (0.1 to 200 lux). LSD contains 6,425 precisely aligned low and normal-light image pairs, selected from over 8,000 dynamic indoor and outdoor scenes through multi-frame acquisition and expert evaluation. To evaluate generalization and aesthetic quality, we collect 2,117 unpaired low-light images from previously unseen devices. To fully exploit LSD, we propose TFFormer, a hybrid model that encodes luminance and chrominance (LC) separately to reduce color-structure entanglement. We further propose a cross-attention-driven joint decoder for context-aware fusion of LC representations, along with LC refinement and LC-guided supervision to significantly enhance perceptual fidelity and structural consistency. TFFormer achieves state-of-the-art results on LSD (+2.45 dB PSNR) and substantially improves downstream vision tasks, such as low-light object detection (+6.80 mAP on ExDark).

[205] CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu

Main category: cs.CV

TL;DR: This paper introduces CogStream, a challenging task for streaming video reasoning that requires models to identify relevant historical context, and proposes CogReasoner as a baseline model that uses visual stream compression and dialogue retrieval to efficiently process streaming videos.

Details

Motivation: Existing Video Large Language Models face computational burden and distraction from irrelevant context when processing streaming videos, as they feed all available historical contextual information into the models, making real-world streaming video reasoning inefficient and less accurate.

Method: The authors propose CogReasoner, a baseline model that tackles streaming video reasoning through two key components: visual stream compression to reduce computational load and historical dialogue retrieval to identify the most relevant contextual information from previous interactions.

Result: Extensive experiments demonstrate the effectiveness of the proposed method. The authors also created a densely annotated dataset with extensive and hierarchical question-answer pairs generated through a semi-automatic pipeline to support the CogStream task.

Conclusion: The paper successfully addresses the challenge of streaming video reasoning by introducing the CogStream task and CogReasoner model, which efficiently processes streaming videos by focusing on relevant historical context rather than all available information, proving effective through experimental validation.

Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. The project is released on https://github.com/LiamZhao326/CogStream.

[206] Balanced Image Stylization with Style Matching Score

Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou

Main category: cs.CV

TL;DR: The paper introduces Style Matching Score (SMS), a novel optimization method for image stylization using diffusion models that reframes stylization as a style distribution matching problem, achieving better balance between style transfer and content preservation through progressive frequency-domain regularization and semantic-aware gradient refinement.

Details

Motivation: Balancing effective style transfer with content preservation remains a long-standing challenge in image stylization. Existing methods struggle to achieve optimal trade-offs between applying artistic styles while maintaining the essential content structure of the original image.

Method: The method reframes image stylization as a style distribution matching problem using off-the-shelf style-dependent LoRAs with carefully designed score functions. It employs Progressive Spectrum Regularization in the frequency domain to guide stylization from low-frequency layouts to high-frequency details, and uses Semantic-Aware Gradient Refinement with relevance maps from diffusion semantic priors to selectively stylize semantically important regions. The optimization extends from pixel space to parameter space for efficient feedforward generators.

Result: SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches as verified by extensive experiments. The method is readily applicable to lightweight feedforward generators for efficient one-step stylization.

Conclusion: The proposed Style Matching Score method successfully addresses the fundamental challenge in image stylization by introducing a distribution matching framework combined with progressive frequency-domain regularization and semantic-aware refinement, achieving superior performance compared to existing state-of-the-art methods while enabling efficient one-step stylization.

Abstract: We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.

[207] PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang

Main category: cs.CV

TL;DR: This paper introduces PEBench, a novel benchmark for evaluating machine unlearning in multimodal large language models (MLLMs), addressing privacy concerns by enabling selective removal of personal entities and event concepts while revealing challenges like cross-concept interference.

Details

Motivation: Existing MLLMs rely on vast internet data raising privacy and security concerns, and current machine unlearning evaluation benchmarks are inadequate, lacking comprehensive scope and focusing narrowly on entities while overlooking broader visual concepts and their semantic coupling.

Method: The authors develop PEBench, a benchmark featuring a fictitious dataset of personal entities and corresponding event scenes to evaluate machine unlearning across distinct yet entangled concepts. They evaluate five MU methods using this benchmark to assess their strengths and weaknesses.

Result: The evaluation reveals that unlearning one concept can unintentionally degrade performance on related concepts within the same image (cross-concept interference). The study demonstrates the difficulty of simultaneously unlearning person and event concepts and shows varying performance across different MU methods.

Conclusion: The paper successfully identifies cross-concept interference as a key challenge in MLLM unlearning and proposes an effective method to mitigate conflicting objectives when unlearning multiple concept types simultaneously. PEBench provides a comprehensive evaluation framework for future machine unlearning research in MLLMs.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable success in vision-language tasks, but their reliance on vast, internet-sourced data raises significant privacy and security concerns. Machine unlearning (MU) has emerged as a critical technique to address these issues, enabling the selective removal of targeted information from pre-trained models without costly retraining. However, the evaluation of MU for MLLMs remains inadequate. Existing benchmarks often lack a comprehensive scope, focusing narrowly on entities while overlooking the unlearning of broader visual concepts and the inherent semantic coupling between them. To bridge this gap, we introduce, PEBench, a novel benchmark designed to facilitate a thorough assessment of MU in MLLMs. PEBench features a fictitious dataset of personal entities and corresponding event scenes to evaluate unlearning across these distinct yet entangled concepts. We leverage this benchmark to evaluate five MU methods, revealing their unique strengths and weaknesses. Our findings show that unlearning one concept can unintentionally degrade performance on related concepts within the same image, a challenge we term cross-concept interference. Furthermore, we demonstrate the difficulty of unlearning person and event concepts simultaneously and propose an effective method to mitigate these conflicting objectives. The source code and benchmark are publicly available at https://pebench.github.io.

[208] Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo, Ceyu Xu, Hao Frank Yang

Main category: cs.CV

TL;DR: EMC2 is an edge-based Mixture of Experts system for autonomous vehicles that achieves both low-latency and high-accuracy 3D object detection by fusing LiDAR and camera data with scenario-aware routing and hardware-software optimizations.

Details

Motivation: Autonomous vehicles require 3D object detection systems that can simultaneously achieve low latency and high accuracy on resource-constrained edge devices, which conventional approaches fail to deliver effectively.

Method: The system uses a scenario-aware MoE architecture with an adaptive multimodal data bridge for multi-scale preprocessing of LiDAR and camera inputs, followed by dynamic feature routing to specialized expert models based on object visibility and distance, combined with joint hardware-software optimizations including resource utilization optimization and computational graph simplification.

Result: On KITTI dataset, EMC2 achieved 3.58% average accuracy improvement and 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains demonstrated on nuScenes dataset.

Conclusion: EMC2 successfully demonstrates that edge-based MoE collaborative computing can advance reliable, real-time 3D object detection for autonomous vehicles by effectively balancing accuracy and latency requirements through intelligent multimodal fusion and optimization strategies.

Abstract: This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs. The official implementation is available at https://github.com/LinshenLiu622/EMC2.

[209] FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, Mengyu Wang

Main category: cs.CV

TL;DR: This paper introduces FiVE, a comprehensive benchmark for fine-grained text-to-video editing that includes 100 videos with 420 object-level editing prompts, and proposes FlowEdit method that adapts rectified flow models for training-free video editing, showing superior performance over diffusion-based approaches.

Details

Motivation: The lack of standardized benchmarks for text-to-video editing evaluation has led to inconsistent performance claims and inability to assess model sensitivity to hyperparameters. There is a critical need for fine-grained video editing capabilities that enable precise, object-level modifications while maintaining temporal consistency and context.

Method: The authors create FiVE benchmark with 74 real-world and 26 generated videos featuring 6 fine-grained editing types and 420 object-level editing prompt pairs with masks. They adapt rectified flow T2V models (Pyramid-Flow and Wan2.1) using FlowEdit to create training-free and inversion-free editing models (Pyramid-Edit and Wan-Edit). They also introduce FiVE-Acc, a novel metric using Vision-Language Models to assess fine-grained editing success.

Result: Evaluation of 5 diffusion-based and 2 rectified flow-based editing methods using 15 metrics shows that RF-based editing significantly outperforms diffusion-based methods. Wan-Edit achieves the best overall performance and exhibits the least sensitivity to hyperparameters across metrics covering background preservation, text-video similarity, temporal consistency, video quality, and runtime.

Conclusion: The FiVE benchmark provides a standardized evaluation framework for fine-grained video editing, and rectified flow-based methods demonstrate superior performance over diffusion-based approaches. The proposed FlowEdit adaptation enables effective training-free video editing, with Wan-Edit showing the most robust and consistent performance across various evaluation metrics.

Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: https://sites.google.com/view/five-benchmark

[210] Learning Streaming Video Representation via Multitask Training

Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie

Main category: cs.CV

TL;DR: The paper presents StreamFormer, a novel streaming video backbone that incorporates causal temporal attention into pre-trained vision transformers to enable efficient real-time video understanding for applications like embodied AI and autonomous driving.

Details

Motivation: Real-time applications like embodied AI and autonomous driving require streaming video understanding that can process video streams frame by frame, preserve historical information, and make low-latency decisions, which differs from offline video understanding approaches.

Method: The authors develop StreamFormer by incorporating causal temporal attention into a pre-trained vision transformer and propose a multitask visual-language alignment framework to unify diverse spatial-temporal video understanding tasks for training.

Result: StreamFormer achieves competitive results on online action detection, online video instance segmentation, and video question answering while maintaining efficiency, demonstrating its effectiveness for real-time applications.

Conclusion: StreamFormer successfully enables efficient streaming video processing while maintaining image representation capability, learning global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously, making it suitable for real-time video understanding applications.

Abstract: Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

[211] INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng

Main category: cs.CV

TL;DR: This paper proposes INTER, a training-free algorithm that reduces hallucinations in large vision-language models by guiding them to better leverage multimodal interaction information, inspired by human cognitive processes.

Details

Motivation: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as models generate plausible but visually inconsistent responses. The authors argue this occurs because LVLMs don't effectively leverage multimodal interaction information like humans do, who first gather multimodal information, analyze cross-modal interactions, and then express understanding through language.

Method: The authors propose INTER (Interaction Guidance Sampling), a training-free algorithm that explicitly guides LVLMs to reapply their understanding of multimodal interaction information when generating responses. The method is inspired by human cognitive behavior and doesn’t require additional training data.

Result: INTER achieves an average improvement of up to 3.4% on five LVLMs compared to state-of-the-art decoding strategies, evaluated across six benchmarks including VQA and image captioning tasks. Experiments also revealed that LVLMs exhibit human-like but less pronounced cognitive behavior on multimodal samples.

Conclusion: The paper demonstrates that by mimicking human cognitive processes for multimodal understanding, LVLMs can significantly reduce hallucinations without additional training. INTER provides a practical, training-free solution to improve the reliability of vision-language models in real-world applications.

Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans’ ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

[212] RDD: Robust Feature Detector and Descriptor using Deformable Transformer

Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, Yajie Zhao

Main category: cs.CV

TL;DR: The paper presents Robust Deformable Detector (RDD), a novel keypoint detector/descriptor using deformable transformers to handle challenging scenarios like significant viewpoint changes in structure-from-motion and SLAM applications.

Details

Motivation: Existing feature detection and description methods fail under challenging scenarios with significant viewpoint changes, and current approaches cannot effectively learn visual cues from long-range relationships despite recognizing the importance of local features in geometric transformations.

Method: The authors develop RDD using deformable transformer architecture that captures global context and geometric invariance through deformable self-attention mechanisms. They also collect an Air-to-Ground dataset to supplement the standard MegaDepth dataset for training.

Result: RDD outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and demonstrates capability for semi-dense matching. The method is evaluated on two challenging benchmarks: one focusing on large viewpoint and scale variations, and an Air-to-Ground benchmark for 3D reconstruction across different altitudes.

Conclusion: The deformable transformer approach effectively reduces search space complexity while modeling geometric invariance, making it superior to existing methods for robust feature detection and description in challenging viewpoint scenarios.

Abstract: As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark – an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.

[213] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu

Main category: cs.CV

TL;DR: This paper introduces CG-AV-Counting, a comprehensive benchmark for video counting tasks with 1,027 multimodal questions over 497 long videos, and proposes AV-Reasoner, a model using reinforcement learning that achieves state-of-the-art performance on counting benchmarks.

Details

Motivation: Current multimodal large language models (MLLMs) struggle with counting tasks in videos, and existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage, necessitating a more comprehensive evaluation framework.

Method: The authors develop CG-AV-Counting benchmark with manually-annotated clues and propose AV-Reasoner, a model trained using GRPO (Group Relative Policy Optimization) and curriculum learning to improve counting capabilities by generalizing from related tasks.

Result: AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning for counting tasks. However, reasoning in language space fails to improve performance on out-of-domain benchmarks.

Conclusion: The paper successfully addresses limitations in video counting evaluation through a comprehensive benchmark and shows that reinforcement learning can improve counting capabilities, though language-based reasoning has limited generalization to out-of-domain scenarios.

Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model’s counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been released on https://av-reasoner.github.io.

[214] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin, Shuting He, Cheston Tan, Bihan Wen

Main category: cs.CV

TL;DR: The paper proposes GroundFlow, a plug-in module that adds temporal reasoning capabilities to existing 3D visual grounding methods for sequential object localization in point clouds following multi-step text instructions.

Details

Motivation: Current 3D visual grounding methods treat multi-step text instructions as a whole without extracting temporal information from each step. Sequential grounding in 3D point clouds (SG3D) requires understanding contextual pronouns and retrieving relevant information from previous steps, which existing methods struggle with due to lack of effective historical information collection modules.

Method: The authors develop GroundFlow, a plug-in temporal reasoning module that can be integrated into existing 3D visual grounding baseline methods. The module selectively extracts both short-term and long-term step information based on relevance to current instructions, enabling comprehensive historical information processing while maintaining temporal understanding as step counts increase.

Result: GroundFlow significantly improves task accuracy of 3DVG baseline methods by +7.5% and +10.2% on the SG3D benchmark, outperforming even 3D large language models pre-trained on various datasets. The method achieves state-of-the-art performance across five datasets in the SG3D benchmark.

Conclusion: The work successfully introduces temporal reasoning capabilities to existing 3DVG models through the GroundFlow module, demonstrating that incorporating historical context and step-wise information significantly improves sequential grounding performance in 3D point clouds for multi-step instruction following tasks.

Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as “it”, “here” and “the same” to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow – a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5% and +10.2%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

[215] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

Zhirui Gao, Renjiao Yi, Yaqiao Dai, Xuening Zhu, Wei Chen, Chenyang Zhu, Kai Xu

Main category: cs.CV

TL;DR: This paper presents CurveGaussian, an end-to-end framework that reconstructs 3D parametric curves directly from multi-view edge maps using a novel bi-directional coupling between parametric curves and edge-oriented Gaussian components, eliminating the error accumulation of traditional two-stage methods.

Details

Motivation: Existing two-stage methods for 3D curve reconstruction follow a sequential pipeline of edge point cloud reconstruction followed by parametric curve fitting, which causes error accumulation due to optimization gaps between disconnected stages. Additionally, parametric curves lack suitability for rendering-based multi-view optimization, creating a need for a representation that preserves geometric properties while enabling differentiable rendering.

Method: The authors propose CurveGaussian, a curve-aware Gaussian representation that creates a bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This enables differentiable rendering of 3D curves for direct optimization from 2D edge maps. They also introduce a dynamically adaptive topology optimization framework that refines curve structures through linearization, merging, splitting, and pruning operations during training.

Result: Comprehensive evaluations on the ABC dataset and real-world benchmarks show the one-stage method outperforms two-stage alternatives, producing cleaner and more robust reconstructions. The method achieves higher efficiency with significantly reduced parameter count during training while maintaining superior performance compared to existing approaches.

Conclusion: The proposed end-to-end CurveGaussian framework successfully eliminates error accumulation from traditional two-stage approaches by directly optimizing 3D parametric curves from multi-view edge maps. The novel coupling mechanism enables effective differentiable rendering while the adaptive topology optimization ensures robust curve reconstruction with improved efficiency.

Abstract: This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting’’ pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method’s superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.

[216] R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

Biao Wang, Wenwen Li, Jiawei Ge

Main category: cs.CV

TL;DR: This paper presents R1-Track, a fine-tuned version of Qwen2.5-VL using reinforcement learning for visual object tracking, which addresses the limitations of traditional template matching approaches by leveraging multi-modal large language models with flexible initialization options.

Details

Motivation: Traditional visual tracking methods require explicit classification/regression modeling, depend on large-scale supervised training, and lack flexibility for multiple tasks. While multi-modal large language models (MLLMs) like Qwen2.5-VL show strong grounding capabilities, they struggle with template matching tasks essential for tracking.

Method: The authors fine-tuned Qwen2.5-VL using Group Relative Policy Optimization (GRPO) reinforcement learning on a small-scale dataset with rule-based reward functions, inspired by deepseek-R1 approach, to create R1-Track.

Result: R1-Track achieved notable performance on the GOT-10k benchmark while supporting flexible initialization through both bounding boxes and text descriptions, and retained most of the original model’s general capabilities.

Conclusion: The study demonstrates that reinforcement learning fine-tuning can successfully adapt MLLMs for visual tracking tasks, offering a more flexible alternative to traditional tracking methods while maintaining general model capabilities. Future improvements are discussed for further enhancement.

Abstract: Visual single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task has traditionally been framed as a template matching problem, evolving through major phases including correlation filters, two-stream networks, and one-stream networks with significant progress achieved. However, these methods typically require explicit classification and regression modeling, depend on supervised training with large-scale datasets, and are limited to the single task of tracking, lacking flexibility. In recent years, multi-modal large language models (MLLMs) have advanced rapidly. Open-source models like Qwen2.5-VL, a flagship MLLMs with strong foundational capabilities, demonstrate excellent performance in grounding tasks. This has spurred interest in applying such models directly to visual tracking. However, experiments reveal that Qwen2.5-VL struggles with template matching between image pairs (i.e., tracking tasks). Inspired by deepseek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method on a small-scale dataset with a rule-based reward function. The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark. R1-Track supports flexible initialization via bounding boxes or text descriptions while retaining most of the original model’s general capabilities. And we further discuss potential improvements for R1-Track. This rough technical report summarizes our findings as of May 2025.

[217] AI-Enhanced Precision in Sport Taekwondo: Increasing Fairness, Speed, and Trust in Competition (FST.ai)

Keivan Shariatmadar, Ahmad Osman

Main category: cs.CV

TL;DR: This paper introduces FST.ai, an AI-powered framework that automates real-time head kick detection and scoring in Sport Taekwondo, reducing decision time from minutes to seconds while improving consistency and transparency in officiating.

Details

Motivation: Traditional sports officiating systems suffer from latency, subjectivity, and inconsistent enforcement even with Instant Video Replay, which undermines fairness and athlete trust. There is a need for automated, objective, and fast decision-making systems in competitive sports environments.

Method: The FST.ai framework leverages computer vision, deep learning, and edge inference technologies. The methodology is based on pose estimation, motion classification, and impact analysis to automatically identify and classify key actions in real-time.

Result: The system successfully automates head kick detection and scoring in Taekwondo, significantly reducing decision time from minutes to seconds while improving consistency and transparency in officiating decisions.

Conclusion: The FST.ai framework demonstrates robustness, scalability, and sport-agnostic potential that can be adapted to various sports requiring action detection, including judo, karate, fencing, football, and basketball, thus having the potential to transform officiating standards across multiple disciplines.

Abstract: The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces ‘FST.ai’ – which is developed under the ‘R3AL.ai’ project, which serves as its Principal Investigator: r3al.ai – a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework – based on pose estimation, motion classification, and impact analysis – can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo’s most challenging scenarios – head kick scoring – we demonstrate the robustness, scalability, and sport-agnostic potential of ‘FST.ai’ to transform officiating standards across multiple disciplines.

[218] Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: Q-Frame is a training-free, plug-and-play approach that adaptively selects video frames and scales resolution based on query content, enabling Video-LLMs to process more frames while preserving critical temporal and spatial information without exceeding computational limits.

Details

Motivation: Existing Video-LLMs using uniform frame sampling struggle to capture query-related crucial spatiotemporal clues effectively due to large data volumes and temporal complexity in video comprehension tasks.

Method: Q-Frame uses adaptive frame selection and multi-resolution scaling tailored to video content and specific queries, employing a text-image matching network like CLIP with the Gumbel-Max trick for efficient frame selection in a training-free, plug-and-play manner.

Result: Extensive experiments on benchmark datasets (MLVU, LongVideoBench, and Video-MME) demonstrate Q-Frame’s superiority over existing methods and its broad applicability across various video understanding tasks.

Conclusion: Q-Frame successfully addresses the limitations of uniform frame sampling in Video-LLMs by enabling adaptive, query-aware frame selection that preserves critical spatiotemporal information while maintaining computational efficiency.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video’s content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame’s effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

[219] CP-uniGuard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang

Main category: cs.CV

TL;DR: CP-uniGuard is a unified defense framework for collaborative perception systems that detects and eliminates malicious agents by using consensus-based verification without requiring prior knowledge of attack probabilities.

Details

Motivation: Collaborative perception systems in multi-agent autonomous driving are vulnerable to attacks from malicious agents that can compromise the shared perception information, creating a critical security issue that needs to be addressed to ensure system reliability and safety.

Method: The framework uses three key components: (1) probability-agnostic sample consensus (PASAC) to sample collaborators and verify consensus without prior attack probabilities, (2) collaborative consistency loss (CCLoss) to measure discrepancy between ego agent and collaborators for object detection and BEV segmentation, and (3) online adaptive threshold via dual sliding windows to dynamically adjust consensus verification thresholds.

Result: Extensive experiments demonstrate the effectiveness of the CP-uniGuard framework in accurately detecting and eliminating malicious agents while maintaining collaborative perception performance in dynamic environments.

Conclusion: CP-uniGuard provides a practical and adaptive solution for securing collaborative perception systems by enabling consensus-based malicious agent detection without requiring prior knowledge of attack probabilities, making it suitable for real-world deployment in multi-agent autonomous systems.

Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent’s perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird’s eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-uniGuard.

Xuan Yao, Junyu Gao, Changsheng Xu

Main category: cs.CV

TL;DR: NavMorph is a self-evolving world model framework for Vision-and-Language Navigation in Continuous Environments (VLN-CE) that uses compact latent representations and Contextual Evolution Memory to improve environmental understanding and adaptive planning, achieving notable performance improvements on VLN-CE benchmarks.

Details

Motivation: Current VLN-CE approaches struggle with generalizing to novel environments and adapting to ongoing changes during navigation. The paper is inspired by human cognition to develop a framework that can better understand environments and make adaptive decisions during navigation tasks.

Method: NavMorph employs a self-evolving world model framework with compact latent representations to model environmental dynamics. It integrates a novel Contextual Evolution Memory that leverages scene-contextual information to support effective navigation while maintaining online adaptability. The framework provides agents with foresight for adaptive planning and policy refinement.

Result: Extensive experiments demonstrate that NavMorph achieves notable performance improvements on popular VLN-CE benchmarks compared to existing approaches.

Conclusion: NavMorph successfully addresses the challenges of generalization and adaptation in VLN-CE by incorporating human cognition-inspired self-evolving world models with contextual memory, leading to improved navigation performance in complex environments.

Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available at https://github.com/Feliciaxyao/NavMorph.

[221] SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: This paper proposes Segment Concept (SeC), a concept-driven video object segmentation framework that uses Large Vision-Language Models to build high-level object representations instead of relying solely on appearance matching, achieving 11.8-point improvement over SAM 2.1 on a new challenging benchmark.

Details

Motivation: Current video object segmentation techniques lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes because they rely on appearance matching and neglect human-like conceptual understanding of objects that enables robust identification across temporal dynamics.

Method: SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames and construct robust conceptual priors. During inference, it forms comprehensive semantic representations of targets and adaptively balances LVLM-based semantic reasoning with enhanced feature matching based on scene complexity.

Result: SeC achieves an 11.8-point improvement over SAM 2.1 on the newly introduced SeCVOS benchmark, establishing a new state-of-the-art in concept-aware video object segmentation. The SeCVOS benchmark comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations.

Conclusion: The paper demonstrates that shifting from conventional feature matching to progressive construction of high-level, object-centric representations significantly improves video object segmentation performance, particularly in scenarios demanding conceptual reasoning and semantic understanding.

Abstract: Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

[222] Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis

Peng Zheng, Junke Wang, Yi Chang, Yizhou Yu, Rui Ma, Zuxuan Wu

Main category: cs.CV

TL;DR: DisCon introduces a novel autoregressive framework that uses discrete tokens as conditional signals to generate continuous representations, achieving superior image generation quality with a gFID score of 1.38 on ImageNet 256×256.

Details

Motivation: Existing autoregressive visual generation models suffer from information loss due to quantization when using discrete tokens, while continuous token prediction faces challenges with unbounded high-dimensional spaces and out-of-distribution artifacts.

Method: DisCon reframes discrete tokens as conditional signals rather than generation targets, modeling the conditional probability of continuous representations given discrete tokens to avoid both quantization information loss and optimization challenges of continuous modeling.

Result: DisCon achieves a gFID score of 1.38 on ImageNet 256×256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.

Conclusion: The proposed DisCon framework successfully addresses the limitations of both discrete and continuous token approaches in autoregressive visual generation, demonstrating significant improvements in image generation quality.

Abstract: Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of 1.38 on ImageNet 256$\times$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin. Project page: https://pengzheng0707.github.io/DisCon.

[223] VICI: VLM-Instructed Cross-view Image-localisation

Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, Safwan Wshah

Main category: cs.CV

TL;DR: This paper presents a two-stage solution for the UAVM 2025 Challenge that matches narrow field-of-view street-level images to satellite imagery, achieving competitive retrieval performance through optimized retrieval and re-ranking strategies.

Details

Motivation: Real-world geo-localization scenarios typically involve limited field-of-view images with unknown camera parameters rather than panoramic views, making it important to explore more practical problem formulations as panoramic cross-view geo-localization approaches peak performance.

Method: A two-stage approach consisting of: (1) retrieving candidate satellite image embeddings for a given street-level query, followed by (2) a re-ranking stage that selectively enhances retrieval accuracy within the top candidates to handle significant viewpoint and scale variations.

Result: The method achieves competitive results on the University-1652 dataset with R@1 and R@10 retrieval rates (specific percentages mentioned as \topone% and \topten% in the abstract but exact values not provided).

Conclusion: The study demonstrates that optimized retrieval and re-ranking strategies have significant potential for advancing practical geo-localization performance, particularly for narrow FOV street-level to satellite image matching tasks.

Abstract: In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone% and \topten% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at https://github.com/tavisshore/VICI.

[224] CL-Polyp: A Contrastive Learning-Enhanced Network for Accurate Polyp Segmentation

Desheng Li, Chaoliang Liu, Zhiyong Xiao

Main category: cs.CV

TL;DR: CL-Polyp is a contrastive learning-enhanced network for polyp segmentation that uses self-supervised learning to improve feature extraction without additional annotations, incorporating MASPP and CA modules for better multi-scale fusion and boundary reconstruction, achieving state-of-the-art results on five benchmark datasets.

Details

Motivation: Existing deep learning polyp segmentation methods using Encoder-Decoder architectures and multi-task frameworks often require more labeled data and depend on task similarity, which limits their generalizability. There is a need for methods that can improve segmentation performance without requiring additional annotations.

Method: The paper proposes CL-Polyp, which uses contrastive learning to enhance the encoder’s discriminative feature extraction by contrasting positive and negative sample pairs from polyp images. The method includes two lightweight modules: Modified Atrous Spatial Pyramid Pooling (MASPP) for multi-scale feature fusion, and Channel Concatenate and Element Add (CA) module for merging low-level and upsampled features to enhance boundary reconstruction.

Result: CL-Polyp consistently outperforms state-of-the-art methods on five benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS). Specifically, it improves the IoU metric by 0.011 on Kvasir-SEG and 0.020 on CVC-ClinicDB datasets compared to existing methods.

Conclusion: CL-Polyp demonstrates effectiveness in clinical polyp segmentation by leveraging contrastive learning for self-supervised feature enhancement without requiring additional annotations, while incorporating efficient modules for improved multi-scale feature processing and boundary reconstruction, achieving superior performance across multiple benchmark datasets.

Abstract: Accurate segmentation of polyps from colonoscopy images is crucial for the early diagnosis and treatment of colorectal cancer. Most existing deep learning-based polyp segmentation methods adopt an Encoder-Decoder architecture, and some utilize multi-task frameworks that incorporate auxiliary tasks like classification to improve segmentation. However, these methods often need more labeled data and depend on task similarity, potentially limiting generalizability. To address these challenges, we propose CL-Polyp, a contrastive learning-enhanced polyp segmentation network. Our method uses contrastive learning to enhance the encoder’s extraction of discriminative features by contrasting positive and negative sample pairs from polyp images. This self-supervised strategy improves visual representation without needing additional annotations. We also introduce two efficient, lightweight modules: the Modified Atrous Spatial Pyramid Pooling (MASPP) module for improved multi-scale feature fusion, and the Channel Concatenate and Element Add (CA) module to merge low-level and upsampled features for {enhanced} boundary reconstruction. Extensive experiments on five benchmark datasets-Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS-show that CL-Polyp consistently surpasses state-of-the-art methods. Specifically, it enhances the IoU metric by 0.011 and 0.020 on the Kvasir-SEG and CVC-ClinicDB datasets, respectively, demonstrating its effectiveness in clinical polyp segmentation.

[225] Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine

Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu

Main category: cs.CV

TL;DR: The paper introduces Great-X, a multimodal data twin platform that integrates ray-tracing computation with Unreal Engine for ISAC research, and creates Great-MSD, a large-scale UAV multimodal dataset with a baseline CSI-based 3D localization algorithm.

Details

Motivation: To explore the potential of scaling laws (successful in LLMs and foundation models) in ISAC (Integrated Sensing and Communications) research by developing a comprehensive simulation platform that can efficiently generate synchronized multimodal data for UAV applications.

Method: Developed Great-X, a single-engine multimodal data twin platform that reconstructs Sionna’s ray-tracing computation within Unreal Engine and integrates with autonomous driving tools. This enables synchronized simulation of multiple data modalities (CSI, RGB, Radar, LiDAR) and creation of the Great-MSD dataset with a baseline CSI-based UAV 3D localization algorithm.

Result: Successfully created an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset (Great-MSD) and demonstrated a baseline CSI-based UAV 3D localization algorithm that shows feasibility and generalizability across different CSI simulation engines.

Conclusion: The Great-X platform successfully enables efficient multimodal data simulation for ISAC research, and the resulting Great-MSD dataset with baseline algorithms demonstrates the feasibility of applying scaling law principles to UAV localization tasks across different simulation environments.

Abstract: Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset will be made available at: https://github.com/hkw-xg/Great-MCD.

[226] Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation

Ming Yin, Fu Wang, Xujiong Ye, Yanda Meng, Zeyu Fu

Main category: cs.CV

TL;DR: MA-SAM2 is a training-free enhancement to SAM2 that improves surgical video segmentation by introducing context-aware and occlusion-resilient memory models, achieving 4.36% and 6.1% performance improvements on surgical datasets without additional parameters or training.

Details

Motivation: SAM2's greedy selection memory design struggles with surgical videos due to rapid instrument movement, frequent occlusion, and complex instrument-tissue interactions, leading to diminished performance in segmenting complex, long surgical videos.

Method: The authors developed Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy that features novel context-aware and occlusion-resilient memory models. The approach uses multi-target, single-loop, one-prompt inference to enhance tracking efficiency in multi-instrument videos.

Result: MA-SAM2 achieved performance improvements of 4.36% on EndoVis2017 and 6.1% on EndoVis2018 datasets compared to SAM2, while maintaining robustness against occlusions and interactions from complex instrument movements without requiring additional parameters or training.

Conclusion: MA-SAM2 successfully addresses SAM2’s limitations in surgical video segmentation through enhanced memory models, demonstrating significant performance gains and practical potential for computer-assisted surgery applications while maintaining the training-free advantage.

Abstract: Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2’s greedy selection memory design are amplified by the unique properties of surgical videos-rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction-resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.

[227] VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding

Younggun Kim, Ahmed S. Abdelrahman, Mohamed Abdel-Aty

Main category: cs.CV

TL;DR: The paper introduces VRU-Accident, a large-scale vision-language benchmark with 1K real-world dashcam accident videos to evaluate multimodal large language models (MLLMs) in safety-critical scenarios involving vulnerable road users like pedestrians and cyclists, revealing that current MLLMs struggle with reasoning about accident causes and prevention.

Details

Motivation: Ensuring safety of vulnerable road users (VRUs) like pedestrians and cyclists is critical for autonomous driving systems since VRU crashes often result in severe consequences. While MLLMs show promise for autonomous vehicles, there lacks a standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical VRU scenarios.

Method: The authors created VRU-Accident benchmark comprising 1K real-world dashcam accident videos with 6K multiple-choice question-answer pairs across six safety-critical categories (24K candidate options, 3.4K unique answers) and 1K dense scene descriptions. The benchmark focuses explicitly on VRU-vehicle accidents with rich annotations capturing spatial-temporal dynamics and causal semantics. They evaluated 17 state-of-the-art MLLMs on multiple-choice VQA and dense captioning tasks.

Result: Comprehensive evaluation of 17 state-of-the-art MLLMs revealed that while models perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability in VRU scenarios.

Conclusion: Current MLLMs have substantial limitations in safety-critical reasoning for VRU scenarios, particularly in understanding accident causality and prevention, highlighting the need for improved models and benchmarks for autonomous driving safety applications.

Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.

[228] Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction

Ayush Gupta, Siyuan Huang, Rama Chellappa

Main category: cs.CV

TL;DR: RG-Gait proposes a residual learning approach for gait recognition that handles occluded sequences while maintaining performance on complete gait data, addressing the practical challenge of occlusions in person re-identification.

Details

Motivation: Current gait recognition methods fail to address practical occlusion problems, and existing occlusion-handling approaches either require impractical paired data or sacrifice performance on holistic inputs, creating a need for a solution that handles both occluded and complete gait sequences effectively.

Method: RG-Gait models occluded gait recognition as a residual learning task, treating occluded gait signatures as residual deviations from holistic gait representations. The network adaptively integrates learned residuals to correct for occlusions without requiring paired training data.

Result: The method demonstrates significant performance improvements on occluded gait sequences while maintaining accuracy on holistic recognition when evaluated on challenging datasets including Gait3D, GREW, and BRIAR.

Conclusion: Residual learning is an effective technique for tackling occluded gait recognition while retaining holistic performance, providing a practical solution for real-world gait-based person re-identification systems.

Abstract: Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention. We release our code publicly at https://github.com/Ayush-00/rg-gait.

[229] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Yanjun Huang

Main category: cs.CV

TL;DR: GEMINUS is a Mixture-of-Experts framework for end-to-end autonomous driving that combines a Global Expert trained on all data with Scene-Adaptive Experts trained on specific scenarios, using a Dual-aware Router to dynamically select experts based on driving conditions, achieving state-of-the-art performance on the Bench2Drive benchmark.

Details

Motivation: Single-mode planning methods in autonomous driving struggle to learn diversified driving skills needed to handle complex and diverse traffic environments, as they attempt to learn an overall policy that fails to adapt to specific scenarios effectively.

Method: The paper proposes GEMINUS, a Mixture-of-Experts framework consisting of: (1) a Global Expert trained on the entire dataset for robust performance, (2) Scene-Adaptive Experts trained on specific scene subsets for adaptive performance, and (3) a Dual-aware Router that considers both scenario-level features and routing uncertainty to dynamically activate appropriate expert modules.

Result: GEMINUS achieves state-of-the-art performance on the Bench2Drive closed-loop benchmark in terms of Driving Score and Success Rate using only monocular vision input. Ablation studies show significant improvements over single-expert baseline: 7.67% improvement in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean.

Conclusion: The effective coupling of Global Expert and Scene-Adaptive Experts through the Dual-aware Router enables GEMINUS to achieve both adaptive and robust performance across diverse driving scenarios, demonstrating the effectiveness of the Mixture-of-Experts approach for end-to-end autonomous driving.

Abstract: End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert, a Scene-Adaptive Experts Group, and equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves adaptive and robust performance in diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. Furthermore, ablation studies demonstrate significant improvements over the original single-expert baseline: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean. The code will be available at https://github.com/newbrains1/GEMINUS.

[230] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

Yongkang Hou, Jiarun Song

Main category: cs.CV

TL;DR: This paper proposes a knowledge distillation method that transfers CLIP’s image quality assessment capabilities to smaller, more efficient models while maintaining superior performance and reducing computational complexity.

Details

Motivation: CLIP-based multimodal methods show excellent generalization in IQA tasks but suffer from excessive parameter burden and insufficient ability to identify local distorted features, limiting their practical deployment.

Method: The authors design quality-graded prompt templates to guide CLIP for quality scoring, fine-tune CLIP for enhanced IQA capabilities, and propose a modality-adaptive knowledge distillation strategy to transfer knowledge from the CLIP teacher model to architecturally advantaged student models.

Result: Experiments on multiple IQA datasets demonstrate that the proposed method significantly reduces model complexity while outperforming existing IQA methods, showing strong potential for practical deployment.

Conclusion: The knowledge distillation approach successfully addresses CLIP’s limitations in IQA by creating more efficient models that maintain superior performance, making them more suitable for real-world applications.

Abstract: Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP’s IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.

cs.AI

[231] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

Shai Shalev-Shwartz, Amnon Shashua

Main category: cs.AI

TL;DR: This paper introduces the Diligent Learner, a new learning paradigm that models reasoning as depth-first search with validation and backtracking to overcome limitations of existing Chain-of-Thought learning methods for large language models.

Details

Motivation: Existing Chain-of-Thought learning approaches like Supervised Fine-Tuning, Reinforcement Learning, Tree-of-Thoughts, and Monte Carlo Tree Search often fail on complex reasoning tasks due to core obstacles including distribution drift, lack of embedded search, and exponential inference costs. The theoretical foundations of learning from CoT data remain underdeveloped.

Method: The paper proposes the Diligent Learner paradigm that explicitly models reasoning as a depth-first search process guided by a validator with support for backtracking upon failure. This approach is designed to work with naturally occurring, incomplete data.

Result: Under two mild and realistic assumptions, the authors prove that the Diligent Learner can efficiently learn from Chain-of-Thought data while existing methods fail to achieve this. The framework provides a scalable approach for building reliable reasoning systems.

Conclusion: The Diligent Learner framework offers a promising path toward developing Large Reasoning Models (LRMs) with robust and interpretable problem-solving abilities, addressing key limitations of current CoT learning methods and enabling scalable reasoning system development.

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful tool for enhancing the problem-solving capabilities of large language models (LLMs). However, the theoretical foundations of learning from CoT data remain underdeveloped, and existing approaches – such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Tree-of-Thoughts (ToT), and Monte Carlo Tree Search (MCTS) – often fail on complex reasoning tasks. In this work, we identify core obstacles that hinder effective CoT learning, including distribution drift, lack of embedded search, and exponential inference costs. We introduce the Diligent Learner, a new learning paradigm that explicitly models reasoning as a depth-first search guided by a validator and supports backtracking upon failure. Under two mild and realistic assumptions, we prove that the Diligent Learner can efficiently learn from CoT data while existing methods fail to do so. This framework offers a path toward building scalable and reliable reasoning systems trained on naturally occurring, incomplete data – paving the way for the development of Large Reasoning Models (LRMs) with robust, interpretable problem-solving abilities.

[232] Purchase and Production Optimization in a Meat Processing Plant

Marek Vlk, Premysl Sucha, Jaroslaw Rudy, Radoslaw Idzikowski

Main category: cs.AI

TL;DR: This paper presents an optimization approach for meat processing companies to efficiently purchase and process materials, addressing real-world constraints like minimum order quantities and expiration dates using an iterative integer linear programming method.

Details

Motivation: The meat production industry faces escalating challenges due to the European energy crisis, making efficient use of input materials essential for profitability. Current literature neglects important real-world constraints in production optimization problems.

Method: The authors develop a simple iterative approach based on integer linear programming (ILP) that handles alternative material processing methods, stock with different expiration dates, minimum order quantities, and minimum percentage requirements in alternatives. They prove the NP-hardness of the problem with these constraints.

Result: The algorithm successfully finds optimal solutions within seconds for all real-world test cases using data from an actual meat processing company. The approach works effectively with open-source ILP solvers and mitigates numerical issues that occurred with commercial solvers.

Conclusion: The proposed iterative ILP approach effectively solves real-life meat processing optimization problems by addressing previously neglected constraints, proving both theoretically sound and practically efficient for industry implementation.

Abstract: The food production industry, especially the meat production sector, faces many challenges that have even escalated due to the recent outbreak of the energy crisis in the European Union. Therefore, efficient use of input materials is an essential aspect affecting the profit of such companies. This paper addresses an optimization problem concerning the purchase and subsequent material processing we solved for a meat processing company. Unlike the majority of existing papers, we do not concentrate on how this problem concerns supply chain management, but we focus purely on the production stage. The problem involves the concept of alternative ways of material processing, stock of material with different expiration dates, and extra constraints widely neglected in the current literature, namely, the minimum order quantity and the minimum percentage in alternatives. We prove that each of these two constraints makes the problem \mbox{$\mathcal{NP}$-hard}, and hence we design a simple iterative approach based on integer linear programming that allows us to solve real-life instances even using an open-source integer linear programming solver. Another advantage of this approach is that it mitigates numerical issues, caused by the extensive range of data values, we experienced with a commercial solver. The results obtained using real data from the meat processing company showed that our algorithm can find the optimum solution in a few seconds for all considered use cases.

[233] Differential Multimodal Transformers

Jerry Li, Timothy Oh, Joseph Hoang, Vardhit Veeramachaneni

Main category: cs.AI

TL;DR: This paper extends Differential Attention mechanism from text-only models to the text-vision model PaliGemma to reduce noise and hallucinations in multimodal contexts with limited context windows.

Details

Motivation: Small language models face challenges with limited context windows when incorporating vision modalities, as Transformer attention mechanisms often focus on irrelevant contexts, leading to noisy information retrieval and hallucinations in multimodal settings.

Method: The authors extended the Differential Attention mechanism (originally designed for text-only models) to PaliGemma, a text-vision model. They fine-tuned the PaliGemma 3B model using LoRA (Low-Rank Adaptation) with integrated Differential Attention, experimenting with various parameter settings and configurations.

Result: The experiments demonstrated that Differential Attention can be successfully adapted and integrated into existing multimodal models during fine-tuning, showing improvements in noisy information retrieval and question-answering capabilities.

Conclusion: Differential Attention mechanism can be effectively extended from text-only to text-vision models to enhance performance by mitigating noisy information retrieval and reducing hallucinations, proving its adaptability for multimodal fine-tuning applications.

Abstract: Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.

[234] Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation

Joydeep Chandra, Satyam Kumar Navneet

Main category: cs.AI

TL;DR: This paper surveys ethical challenges and design principles for proactive AI agents in smart homes, with special focus on protecting vulnerable users like elderly, children, and neurodivergent individuals through responsible innovation frameworks and human-centered design.

Details

Motivation: The shift from reactive to proactive autonomous AI agents in household environments creates new ethical challenges around privacy, fairness, user control, and surveillance, particularly for vulnerable populations who face higher risks of bias and privacy violations.

Method: The authors conduct a comprehensive review of responsible innovation frameworks, human-centered design principles, and governance practices, combined with analysis of social media data using Natural Language Processing to understand user needs and ethical concerns for different vulnerable groups.

Result: The study identifies key design imperatives including tailored explainability mechanisms, granular consent systems, robust override controls, and participatory design methodologies. It provides practical guidance for developing ethical smart home systems that address specific vulnerabilities of elderly, children, and neurodivergent users.

Conclusion: The research establishes both conceptual foundations and practical recommendations for creating transparent, inclusive, and trustworthy agentic AI in household automation, emphasizing the need for specialized protections and design considerations for vulnerable user populations.

Abstract: The implementation of Artificial Intelligence (AI) in household environments, especially in the form of proactive autonomous agents, brings about possibilities of comfort and attention as well as it comes with intra or extramural ethical challenges. This article analyzes agentic AI and its applications, focusing on its move from reactive to proactive autonomy, privacy, fairness and user control. We review responsible innovation frameworks, human-centered design principles, and governance practices to distill practical guidance for ethical smart home systems. Vulnerable user groups such as elderly individuals, children, and neurodivergent who face higher risks of surveillance, bias, and privacy risks were studied in detail in context of Agentic AI. Design imperatives are highlighted such as tailored explainability, granular consent mechanisms, and robust override controls, supported by participatory and inclusive methodologies. It was also explored how data-driven insights, including social media analysis via Natural Language Processing(NLP), can inform specific user needs and ethical concerns. This survey aims to provide both a conceptual foundation and suggestions for developing transparent, inclusive, and trustworthy agentic AI in household automation.

[235] Why Braking? Scenario Extraction and Reasoning Utilizing LLM

Yin Wu, Daniel Slieter, Vivek Subramanian, Ahmed Abouelazm, Robin Bohn, J. Marius Zöllner

Main category: cs.AI

TL;DR: This paper proposes a Large Language Model-based framework to identify and understand safety-critical braking scenarios in driving data, moving beyond rule-based methods to enable better generalization in complex urban environments.

Details

Motivation: The increasing volume of ADAS driving data mostly captures routine behavior, making it challenging to identify safety-critical corner cases. Existing rule-based heuristics work well on highways but fail to generalize in complex urban settings, creating a need for better scenario understanding methods.

Method: The authors develop a novel framework using Large Language Models to bridge low-level numerical driving signals with natural language descriptions. They implement a dual-path scenario retrieval system that supports both category-based search for known scenarios and embedding-based retrieval for unknown out-of-distribution scenarios.

Result: Experimental evaluation on the curated Argoverse 2 Sensor Dataset annotations shows that the proposed LLM-based method outperforms rule-based baselines and demonstrates good generalization capabilities for out-of-distribution scenarios.

Conclusion: The LLM-based framework successfully addresses the limitations of rule-based approaches by providing better scenario understanding and reasoning capabilities, particularly excelling in complex urban environments and generalizing well to unknown scenarios.

Abstract: The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.

[236] Re-evaluating Short- and Long-Term Trend Factors in CTA Replication: A Bayesian Graphical Approach

Eric Benhamou, Jean-Jacques Ohana, Alban Etienne, Béatrice Guez, Ethan Setrouk, Thomas Jacquot

Main category: cs.AI

TL;DR: This paper analyzes CTA trading strategies by decomposing returns into short-term trend, long-term trend, and market beta factors using Bayesian modeling to understand how different time horizons affect risk-adjusted performance.

Details

Motivation: The relative merits and interactions between short-term and long-term trend-following systems in CTA strategies remain controversial and poorly understood, despite extensive research on trend following in general.

Method: Dynamic decomposition of CTA returns using a Bayesian graphical model to separate contributions from short-term trend factors, long-term trend factors, and market beta factors.

Result: The study demonstrates how different blends of short-term and long-term trend horizons impact the overall risk-adjusted performance of CTA strategies.

Conclusion: The blend of different time horizons in trend-following strategies significantly shapes the risk-adjusted performance of CTA strategies, providing insights into optimizing the combination of short-term and long-term trend components.

Abstract: Commodity Trading Advisors (CTAs) have historically relied on trend-following rules that operate on vastly different horizons from long-term breakouts that capture major directional moves to short-term momentum signals that thrive in fast-moving markets. Despite a large body of work on trend following, the relative merits and interactions of short-versus long-term trend systems remain controversial. This paper adds to the debate by (i) dynamically decomposing CTA returns into short-term trend, long-term trend and market beta factors using a Bayesian graphical model, and (ii) showing how the blend of horizons shapes the strategy’s risk-adjusted performance.

[237] Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning

Simon Ouellette

Main category: cs.AI

TL;DR: This paper compares neural program synthesis and test-time fine-tuning (TTFT) approaches on ARC-AGI compositional generalization tasks, finding that execution-guided neural program synthesis outperforms other methods in composing novel solutions, while TTFT mainly helps by eliciting existing in-distribution knowledge from LLMs.

Details

Motivation: To evaluate and compare different approaches for compositional generalization in open-world problem domains where out-of-distribution generalization is essential, specifically using the ARC-AGI domain as a controlled experimental setting.

Method: The researchers conducted a controlled compositional generalization experiment in the ARC-AGI domain, comparing neural program synthesis (specifically execution-guided neural program synthesis) against test-time fine-tuning approaches and other reference algorithms.

Result: Execution-guided neural program synthesis demonstrated superior performance compared to all reference algorithms in its ability to compose novel solutions for out-of-distribution generalization tasks in the ARC-AGI domain.

Conclusion: Neural program synthesis, particularly when execution-guided, is more effective than test-time fine-tuning for compositional generalization in open-world domains. The study reveals that TTFT’s effectiveness on ARC-AGI is primarily due to its ability to elicit in-distribution knowledge that LLMs possess but fail to utilize directly, rather than enabling true compositional generalization.

Abstract: We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.

[238] The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture

Andy E. Williams

Main category: cs.AI

TL;DR: This paper introduces the Recursive Coherence Principle (RCP), which states that intelligent systems require structural coherence across recursive reasoning processes to scale effectively, and proposes the Functional Model of Intelligence (FMI) as the only architecture capable of maintaining this coherence at any scale.

Details

Motivation: Current AI systems suffer from misalignment, hallucination, and instability issues that worsen as they scale. The authors argue these problems stem from a lack of structural coherence in recursive reasoning processes, and existing approaches focus on behavioral constraints rather than addressing the fundamental structural requirements for coherent intelligence.

Method: The paper formally defines the Recursive Coherence Principle (RCP) as a foundational constraint for reasoning systems, and introduces the Functional Model of Intelligence (FMI) as a minimal, composable architecture with specific internal functions (evaluation, modeling, adaptation, stability, decomposition, bridging) and external functions (storage, recall, System 1 and System 2 reasoning) designed to preserve semantic structure across inference layers.

Result: The authors prove that systems lacking the FMI architecture will experience recursive coherence breakdown as they scale, and demonstrate that the FMI is the only known operator capable of satisfying the RCP at any scale. They show that common AI problems like misalignment and hallucination are symptoms of structural coherence loss rather than isolated issues.

Conclusion: The work advocates for a paradigm shift in AI development from behavioral constraints to structural coherence approaches. The RCP and FMI provide a theoretical foundation for building safely generalizable and robustly coherent AI systems at scale, offering a new pathway for AI alignment that addresses the root structural causes of current AI limitations.

Abstract: Intelligence-biological, artificial, or collective-requires structural coherence across recursive reasoning processes to scale effectively. As complex systems grow, coherence becomes fragile unless a higher-order structure ensures semantic consistency. This paper introduces the Recursive Coherence Principle (RCP): a foundational constraint stating that for any reasoning system of order N, composed of systems operating over conceptual spaces of order N-1, semantic coherence is preserved only by a recursively evaluable generalization operator that spans and aligns those lower-order conceptual spaces. Crucially, this coherence enables structural alignment. Without recursive coherence, no system can reliably preserve goals, meanings, or reasoning consistency at scale. We formally define the Functional Model of Intelligence (FMI) as the only known operator capable of satisfying the RCP at any scale. The FMI is a minimal, composable architecture with internal functions (evaluation, modeling, adaptation, stability, decomposition, bridging) and external functions (storage, recall, System 1 and System 2 reasoning) vital for preserving semantic structure across inference and coordination layers. We prove that any system lacking the FMI will experience recursive coherence breakdown as it scales, arguing that common AI issues like misalignment, hallucination, and instability are symptoms of this structural coherence loss. Unlike other foundational principles, RCP uniquely captures the internal, recursive dynamics needed for coherent, alignable intelligence, modeling semantic coherence under recursion. This work significantly impacts AI alignment, advocating a shift from behavioral constraints to structural coherence, and offers a pathway for safely generalizable, robustly coherent AI at scale.

[239] ADEPTS: A Capability Framework for Human-Centered Agent Design

Pierluca D’Oro, Caley Drooff, Joy Chen, Joseph Tighe

Main category: cs.AI

TL;DR: This paper introduces ADEPTS, a capability framework that defines six core user-facing principles for developing human-centered AI agents, bridging the gap between technical development and user experience design.

Details

Motivation: Current guidance for AI agent development is fragmented across UX heuristics, engineering taxonomies, and ethics checklists, lacking a unified, user-facing vocabulary that defines what agents should fundamentally be able to do for human-centered design.

Method: The authors developed ADEPTS, a capability framework based on six principles for human-centered agent design that express minimal user-facing capabilities needed for agents to be understandable, controllable, and trustworthy in everyday use.

Result: ADEPTS provides a compact, actionable framework that sits at the interface between technical and experience development, offering unified guidance for AI researchers, designers, engineers, and policy reviewers.

Conclusion: ADEPTS has the potential to accelerate improvement of user-relevant agent capabilities, ease the design of user experiences, and provide a shared language for tracking and discussing AI agent development progress.

Abstract: Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross-disciplinary approach to developing, monitoring and discussing the capabilities required for agent-driven user experiences. However, current guidance on human-centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high-level governance. There is no concise, user-facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user-facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human-centered agent design, that express the minimal, user-facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI-UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user-relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.

[240] Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture

Lisa Dargasz

Main category: cs.AI

TL;DR: This paper introduces Reason-Based Artificial Moral Agents (RBAMAs), which extend reinforcement learning architectures to enable AI agents to make ethical decisions through normative reasoning and case-based learning, addressing the growing need for ethical AI as autonomous systems approach real-world deployment.

Details

Motivation: As AI agents become increasingly capable and approach market readiness for real-world deployment (humanoid robots, autonomous cars), there is an urgent need to ensure these systems behave ethically. Current reinforcement learning agents lack the ability to make moral decisions, creating a gap between technical capability and ethical requirements for autonomous operation.

Method: The paper proposes RBAMAs - an extension of reinforcement learning architecture that incorporates moral decision-making capabilities. The method involves equipping agents with the capacity to learn a “reason-theory” through case-based feedback, enabling them to process morally relevant propositions and derive moral obligations. The agents adapt their behavior to conform to these moral obligations while pursuing their designated tasks.

Result: The study presents a first implementation of an RBAMA and demonstrates its potential through initial experiments. The results show that RBAMAs can contribute to moral justifiability of actions, moral robustness, and moral trustworthiness in AI agents.

Conclusion: RBAMAs represent a concrete and deployable framework for developing Artificial Moral Agents (AMAs) that fulfills key ethical requirements. The extended architecture successfully addresses challenges at the intersection of computer science and philosophy, providing a practical solution for ethical AI development as autonomous systems transition from laboratory prototypes to real-world applications.

Abstract: Reinforcement Learning is a machine learning methodology that has demonstrated strong performance across a variety of tasks. In particular, it plays a central role in the development of artificial autonomous agents. As these agents become increasingly capable, market readiness is rapidly approaching, which means those agents, for example taking the form of humanoid robots or autonomous cars, are poised to transition from laboratory prototypes to autonomous operation in real-world environments. This transition raises concerns leading to specific requirements for these systems - among them, the requirement that they are designed to behave ethically. Crucially, research directed toward building agents that fulfill the requirement to behave ethically - referred to as artificial moral agents(AMAs) - has to address a range of challenges at the intersection of computer science and philosophy. This study explores the development of reason-based artificial moral agents (RBAMAs). RBAMAs are build on an extension of the reinforcement learning architecture to enable moral decision-making based on sound normative reasoning, which is achieved by equipping the agent with the capacity to learn a reason-theory - a theory which enables it to process morally relevant propositions to derive moral obligations - through case-based feedback. They are designed such that they adapt their behavior to ensure conformance to these obligations while they pursue their designated tasks. These features contribute to the moral justifiability of the their actions, their moral robustness, and their moral trustworthiness, which proposes the extended architecture as a concrete and deployable framework for the development of AMAs that fulfills key ethical desiderata. This study presents a first implementation of an RBAMA and demonstrates the potential of RBAMAs in initial experiments.

[241] Does More Inference-Time Compute Really Help Robustness?

Tong Wu, Chong Xiang, Jiachen T. Wang, Weichen Yu, Chawin Sitawarin, Vikash Sehwag, Prateek Mittal

Main category: cs.AI

TL;DR: This paper reveals that while inference-time scaling improves robustness in reasoning LLMs when intermediate steps are hidden, it actually reduces robustness when these steps are exposed to adversaries, creating an inverse scaling law that challenges assumptions in prior work.

Details

Motivation: Prior work by Zaremba et al. showed that increasing inference-time computation improves robustness in large proprietary reasoning LLMs, but this relied on an implicit assumption that intermediate reasoning steps remain hidden from adversaries. The authors aim to examine this assumption and investigate potential security risks when it's violated.

Method: The researchers tested inference-time scaling on smaller-scale, open-source models (DeepSeek R1, Qwen3, Phi-reasoning) using a budget forcing strategy. They then systematically examined scenarios where intermediate reasoning steps are accessible to adversaries, analyzing the impact on model robustness across different adversarial settings.

Result: The study confirms that smaller open-source models benefit from inference-time scaling when reasoning is hidden. However, when intermediate reasoning steps become accessible to adversaries, increased inference-time computation consistently reduces model robustness, following an inverse scaling law. The authors also identified practical vulnerabilities in models with tool-integrated reasoning and advanced reasoning extraction attacks.

Conclusion: The robustness benefits of inference-time scaling are heavily dependent on the adversarial setting and deployment context. The assumption that intermediate reasoning remains hidden is critical but often unrealistic in practice. Practitioners should carefully consider these trade-offs before implementing inference-time scaling in security-sensitive real-world applications.

Abstract: Recently, Zaremba et al. demonstrated that increasing inference-time computation improves robustness in large proprietary reasoning LLMs. In this paper, we first show that smaller-scale, open-source models (e.g., DeepSeek R1, Qwen3, Phi-reasoning) can also benefit from inference-time scaling using a simple budget forcing strategy. More importantly, we reveal and critically examine an implicit assumption in prior work: intermediate reasoning steps are hidden from adversaries. By relaxing this assumption, we identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law: if intermediate reasoning steps become explicitly accessible, increased inference-time computation consistently reduces model robustness. Finally, we discuss practical scenarios where models with hidden reasoning chains are still vulnerable to attacks, such as models with tool-integrated reasoning and advanced reasoning extraction attacks. Our findings collectively demonstrate that the robustness benefits of inference-time scaling depend heavily on the adversarial setting and deployment context. We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.

Xi Yang, Jiachen Wang, Song Han, Suining He

Main category: cs.AI

TL;DR: BikeMAN is a multi-level spatio-temporal attention neural network that predicts bike station-level traffic for entire bike sharing systems, achieving high accuracy on NYC bike sharing data with over 10 million trips across 700+ stations.

Details

Motivation: Urban bike sharing systems face challenges with unbalanced station-level demand and supply, making maintenance difficult. Existing prediction methods struggle with the spatial-temporal complexity and large scale of bike sharing systems with numerous stations, creating a need for better station-level traffic prediction across entire systems.

Method: The authors propose BikeMAN, a multi-level spatio-temporal attention neural network with an encoder-decoder architecture. The network incorporates two attention mechanisms: one for capturing spatial correlations between bike station features across the system, and another for modeling temporal characteristics of bike station traffic patterns.

Result: The network demonstrated high accuracy in predicting bike station traffic for all stations in New York City through experimental validation on over 10 million bike sharing trips across more than 700 stations.

Conclusion: BikeMAN successfully addresses the challenge of station-level bike traffic prediction for entire bike sharing systems by effectively modeling both spatial and temporal dependencies, showing promising results for improving bike sharing system efficiency and maintenance.

Abstract: Efficient use of urban micromobility resources such as bike sharing is challenging due to the unbalanced station-level demand and supply, which causes the maintenance of the bike sharing systems painstaking. Prior efforts have been made on accurate prediction of bike traffics, i.e., demand/pick-up and return/drop-off, to achieve system efficiency. However, bike station-level traffic prediction is difficult because of the spatial-temporal complexity of bike sharing systems. Moreover, such level of prediction over entire bike sharing systems is also challenging due to the large number of bike stations. To fill this gap, we propose BikeMAN, a multi-level spatio-temporal attention neural network to predict station-level bike traffic for entire bike sharing systems. The proposed network consists of an encoder and a decoder with an attention mechanism representing the spatial correlation between features of bike stations in the system and another attention mechanism describing the temporal characteristic of bike station traffic. Through experimental study on over 10 millions trips of bike sharing systems (> 700 stations) of New York City, our network showed high accuracy in predicting the bike station traffic of all stations in the city.

[243] From Logic to Language: A Trust Index for Problem Solving with LLMs

Tehseen Rug, Felix Böhmer, Tessa Pfattheicher

Main category: cs.AI

TL;DR: This paper introduces a unified framework to distinguish between classical formal computation and LLM-based natural language problem-solving, proposing new metrics to evaluate the quality of solutions in ambiguous, subjective domains that LLMs can address but classical computation cannot.

Details

Motivation: Classical computation excels at problems with unambiguous rules but cannot address the vast domain of human problems characterized by ambiguity, dynamic environments, and subjective context. LLMs represent a paradigm shift by enabling computational engagement with previously inaccessible problem domains through natural language.

Method: The authors develop a unified framework that defines and delineates problem spaces addressable by formal versus natural language. They introduce a vector-valued trust index Q to distinguish binary correctness of formal solutions from continuous adequacy of natural language solutions. Two statistical quality dimensions are proposed: normalized bi-semantic entropy (measuring robustness and conceptual diversity) and emotional valence (mapping subjective valuation to quantifiable metrics).

Result: The framework successfully contrasts formal and natural language problem-solving paradigms, establishing that formal language solutions use binary quality measures while natural language solutions require nuanced evaluation accounting for vagueness, subjectivity, and ambiguity. The proposed metrics provide quantifiable ways to assess LLM solution quality in subjective domains.

Conclusion: The introduced framework and metrics provide a more rigorous understanding of LLM capabilities, limitations, and the fundamental nature of problem-solving in the age of large language models, enabling better evaluation of solutions in ambiguous and subjective problem domains.

Abstract: Classical computation, grounded in formal, logical systems, has been the engine of technological progress for decades, excelling at problems that can be described with unambiguous rules. This paradigm, however, leaves a vast ocean of human problems – those characterized by ambiguity, dynamic environments, and subjective context – largely untouched. The advent of Large Language Models (LLMs) represents a fundamental shift, enabling computational systems to engage with this previously inaccessible domain using natural language. This paper introduces a unified framework to understand and contrast these problem-solving paradigms. We define and delineate the problem spaces addressable by formal languages versus natural language. While solutions to the former problem class can be evaluated using binary quality measures, the latter requires a much more nuanced definition of approximate solution space taking into account the vagueness, subjectivity and ambiguity inherent to natural language. We therefore introduce a vector-valued trust index Q, which reflects solution quality and distinguishes the binary correctness of formal solutions from the continuous adequacy spectrum characteristic of natural language solutions. Within this framework, we propose two statistical quality dimensions. Normalized bi-semantic entropy measures robustness and conceptual diversity of LLM answers given semantic variation in problem formulations. Emotional valence maps subjective valuation of a solution to a quantifiable metric that can be maximized by invoking statistical measures. The concepts introduced in this work will provide a more rigorous understanding of the capabilities, limitations, and inherent nature of problem-solving in the age of LLMs.

[244] A Unifying Framework for Semiring-Based Constraint Logic Programming With Negation (full version)

Jeroen Spaans, Jesse Heyninck

Main category: cs.AI

TL;DR: This paper presents a unifying extension of Constraint Logic Programming (CLP) that allows negation in clause bodies while using semiring abstractions to generalize various existing CLP extensions like fuzzy constraints and uncertainty handling.

Details

Motivation: Existing CLP extensions (fuzzy constraint satisfaction, uncertainty, negation) use different semiring abstractions but none allow negation in clause bodies. There's a need for a unifying framework that captures these approaches while supporting more expressive language constructs.

Method: The authors extend CLP using semiring abstractions as a unifying mechanism and employ approximation fixpoint theory to provide semantics for programs with negation in clause bodies. They analyze how different semiring properties impact the resulting semantics.

Result: The paper successfully develops a unifying framework that encompasses existing CLP extensions while enabling negation in clause bodies. They provide detailed analysis of how semiring properties affect the semantics of such programs.

Conclusion: The work provides a comprehensive unifying framework for CLP extensions that allows for more expressive programming constructs (negation in bodies) while maintaining the theoretical foundations through semiring abstractions and approximation fixpoint theory.

Abstract: Constraint Logic Programming (CLP) is a logic programming formalism used to solve problems requiring the consideration of constraints, like resource allocation and automated planning and scheduling. It has previously been extended in various directions, for example to support fuzzy constraint satisfaction, uncertainty, or negation, with different notions of semiring being used as a unifying abstraction for these generalizations. None of these extensions have studied clauses with negation allowed in the body. We investigate an extension of CLP which unifies many of these extensions and allows negation in the body. We provide semantics for such programs, using the framework of approximation fixpoint theory, and give a detailed overview of the impacts of properties of the semirings on the resulting semantics. As such, we provide a unifying framework that captures existing approaches and allows extending them with a more expressive language.

[245] Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Shengchao Liu, Hannan Xu, Yan Ai, Huanxin Li, Yoshua Bengio, Harry Guo

Main category: cs.AI

TL;DR: Researchers developed ChatBattery, an AI framework that uses large language models with chain-of-thought reasoning and domain knowledge to discover new battery materials, successfully identifying three novel lithium-ion cathode materials with 18-29% capacity improvements over existing materials.

Details

Motivation: While LLMs have shown strong reasoning capabilities in math and coding, their potential for domain-specific applications like materials discovery remains largely unexplored. The researchers were inspired by the idea that reasoning mirrors guided search and wanted to integrate domain knowledge to improve LLM reasoning for battery materials design.

Method: The researchers introduced ChatBattery, an agentic framework that integrates domain knowledge to guide LLMs toward more effective reasoning in materials design. The framework uses chain-of-thought techniques combined with battery-specific domain expertise to steer the reasoning process for materials discovery.

Result: ChatBattery successfully identified, synthesized, and characterized three novel lithium-ion battery cathode materials that achieved practical capacity improvements of 28.8%, 25.2%, and 18.5% respectively over the widely used NMC811 cathode material. The framework demonstrated a complete AI-driven cycle from design to synthesis to characterization.

Conclusion: ChatBattery demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery by establishing a successful LLM-driven and reasoning-based platform for battery materials invention. This work paves a new path for applying advanced AI reasoning capabilities to domain-specific scientific applications beyond traditional math and coding problems.

Abstract: Large language models (LLMs) leverage chain-of-thought (CoT) techniques to tackle complex problems, representing a transformative breakthrough in artificial intelligence (AI). However, their reasoning capabilities have primarily been demonstrated in solving math and coding problems, leaving their potential for domain-specific applications-such as battery discovery-largely unexplored. Inspired by the idea that reasoning mirrors a form of guided search, we introduce ChatBattery, a novel agentic framework that integrates domain knowledge to steer LLMs toward more effective reasoning in materials design. Using ChatBattery, we successfully identify, synthesize, and characterize three novel lithium-ion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811). Beyond this discovery, ChatBattery paves a new path by showing a successful LLM-driven and reasoning-based platform for battery materials invention. This complete AI-driven cycle-from design to synthesis to characterization-demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery.

[246] TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

Michael R. Bock, Kara Molisee, Zachary Ozer, Sumit Shah

Main category: cs.AI

TL;DR: This paper introduces TaxCalcBench, a benchmark to evaluate AI models’ ability to calculate US personal income taxes, finding that state-of-the-art models succeed in less than one-third of cases due to errors in tax table usage, calculations, and eligibility determination.

Details

Motivation: The motivation is to assess whether AI can handle the complex task of calculating US personal income taxes, which requires understanding vast amounts of English text and performing careful computations - a practical question of "Can AI file your taxes?"

Method: The authors developed TaxCalcBench, a benchmark dataset for evaluating models’ abilities to calculate personal income tax returns when provided with all necessary information, and tested state-of-the-art AI models on this simplified sample set.

Result: State-of-the-art models succeeded in calculating less than one-third of federal income tax returns correctly. The analysis revealed that models consistently make three types of errors: misusing tax tables, making calculation errors, and incorrectly determining eligibility.

Conclusion: The findings demonstrate that current AI models are inadequate for personal income tax calculation tasks and highlight the need for additional infrastructure to effectively apply Large Language Models (LLMs) to tax computation.

Abstract: Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models’ abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.

[247] SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Shuhao Mei, Yongchao Long, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong

Main category: cs.AI

TL;DR: SpiroLLM is the first multimodal large language model that can understand spirogram data for COPD diagnosis, combining respiratory curve analysis with text generation to provide interpretable diagnostic reports with 0.8980 AUROC performance.

Details

Motivation: Current AI models for COPD diagnosis lack interpretability and only provide classification results without rationale, while existing LLMs cannot understand spirogram data, limiting clinical trust and adoption for respiratory disease diagnosis.

Method: The authors developed SpiroLLM using 234,028 individuals from UK Biobank, incorporating a SpiroEncoder to extract morphological features from respiratory curves, a SpiroProjector to align spirogram features with PFT numerical values in unified latent space, and integrating these with a large language model to generate comprehensive diagnostic reports.

Result: SpiroLLM achieved diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132) and demonstrated superior robustness with 100% valid response rate when core data was missing, compared to only 13.4% for text-only models, showcasing the advantage of multimodal design.

Conclusion: This work establishes a new paradigm for interpretable clinical decision support by demonstrating the potential of integrating physiological signals with large language models, creating the first LLM capable of understanding spirogram data for reliable COPD diagnosis.

Abstract: Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

[248] Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind (A Position Paper)

Myung Ho Kim

Main category: cs.AI

TL;DR: Researchers discovered that their practical AI agent architecture called Agentic Flow unintentionally converged with four major theories of mind (Kahneman’s dual-system, Friston’s predictive processing, Minsky’s society of mind, and Clark’s extended mind), achieving 95.8% task success compared to 62.3% for baseline LLMs, and introduced PEACE as a meta-architecture to describe these emergent design patterns.

Details

Motivation: The paper explores how practical AI system design can naturally converge with established cognitive theories without intentional top-down theoretical implementation, addressing limitations in current large language models while examining the emergence of theoretical structures through real-world implementation demands.

Method: The researchers designed Agentic Flow, a five-module AI architecture (Retrieval, Cognition, Control, Memory, Action) arranged in a recurrent cognitive loop, conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks, and performed retrospective analysis to identify convergences with cognitive theories, culminating in the introduction of PEACE as a descriptive meta-architecture.

Result: Agentic Flow achieved 95.8% task success rate with strong constraint adherence compared to 62.3% success rate for baseline systems, and demonstrated structural alignment with computational motifs from four influential theories of mind including predictive modeling, associative recall, and error-sensitive control mechanisms.

Conclusion: The study suggests that theoretical structures in cognitive science may emerge naturally through practical design choices rather than requiring explicit theoretical implementation, proposing that real-world implementation demands can surface latent structural echoes of cognitive theory without necessitating theoretical unification.

Abstract: We report the discovery of a structural convergence across four influential theories of mind: Kahneman’s dual-system theory, Friston’s predictive processing, Minsky’s society of mind, and Clark’s extended mind-emerging unintentionally within a practical AI agent architecture called Agentic Flow. Designed to address limitations in large language models (LLMs), Agentic Flow comprises five interdependent modules such as Retrieval, Cognition, Control, Memory, and Action arranged in a recurrent cognitive loop. Although originally inspired only by Minsky and Clark, the system’s structure retrospectively aligns with computational motifs found in all four theories, including predictive modeling, associative recall, and error-sensitive control. To assess this convergence, we conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks. The structured agent achieved 95.8% task success and exhibited strong constraint adherence, while the baseline system succeeded 62.3% of the time. These results were not aimed at proving superiority, but at illustrating how theoretical structures may emerge through practical design choices rather than top-down theory. We introduce PEACE as a descriptive meta-architecture that captures design-level regularities observed in Agentic Flow. Not intended as a new theory, PEACE provides a shared vocabulary for understanding architectures shaped by real-world implementation demands. This paper should be read as a position paper - an exploratory reflection on how implementation can surface latent structural echoes of cognitive theory, without asserting theoretical unification.

[249] CHIMERA: Compressed Hybrid Intelligence for Twin-Model Enhanced Multi-Agent Deep Reinforcement Learning for Multi-Functional RIS-Assisted Space-Air-Ground Integrated Networks

Li-Hsiang Shen, Jyun-Jhe Huang

Main category: cs.AI

TL;DR: This paper proposes a space-air-ground integrated network (SAGIN) architecture enhanced with multi-functional reconfigurable intelligent surfaces (MF-RIS) that can reflect, amplify, and harvest energy simultaneously. A novel CHIMERA deep reinforcement learning framework is developed to optimize the system parameters for maximum energy efficiency, addressing energy shortages in LEO satellites operating in shadowed regions.

Details

Motivation: The motivation is to address energy shortages of low-Earth orbit (LEO) satellites operating in shadowed regions while maximizing long-term energy efficiency in integrated space-air-ground networks. Traditional approaches lack the capability to simultaneously handle communication, computing energy consumption, and energy harvesting across multiple network layers.

Method: The authors propose a SAGIN architecture empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) with simultaneous reflection, amplification, and energy harvesting capabilities. They develop a compressed hybrid intelligence for twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework that integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to jointly optimize MF-RIS parameters (signal amplification, phase-shifts, energy harvesting ratio, active element selection) and SAGIN parameters (beamforming vectors, HAPS deployment, user association, computing capability).

Result: Simulation results demonstrate that the proposed CHIMERA scheme substantially outperforms conventional benchmarks including fixed-configuration or non-harvesting MF-RIS, traditional RIS, no-RIS cases, and centralized/multi-agent deep reinforcement learning baselines in terms of energy efficiency. The SAGIN-MF-RIS architecture achieves superior energy efficiency performance through complementary coverage, offering notable advantages over standalone satellite, aerial, or ground-only deployments.

Conclusion: The proposed SAGIN architecture with MF-RIS and CHIMERA optimization framework successfully addresses energy efficiency challenges in integrated networks, particularly for LEO satellites in shadowed regions. The multi-functional capabilities of MF-RIS combined with advanced deep reinforcement learning optimization significantly outperform existing approaches, demonstrating the effectiveness of integrated space-air-ground networks with intelligent surface technologies.

Abstract: A space-air-ground integrated network (SAGIN) architecture is proposed, empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF-RIS plays a pivotal role in addressing the energy shortages of low-Earth orbit (LEO) satellites operating in shadowed regions, while explicitly accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long-term energy efficiency (EE), we formulate a joint optimization problem over the MF-RIS parameters, including signal amplification, phase-shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high-altitude platform station (HAPS) deployment, user association, and computing capability. The formulated problem is highly non-convex and non-linear and contains mixed discrete-continuous parameters. To tackle this, we conceive a compressed hybrid intelligence for twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed-configuration or non-harvesting MF-RIS, traditional RIS, and no-RIS cases, as well as centralized and multi-agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN-MF-RIS architecture achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground-only deployments.

[250] Distilled Large Language Model in Confidential Computing Environment for System-on-Chip Design

Dong Ben, Hui Feng, Qian Wang

Main category: cs.AI

TL;DR: This paper evaluates the performance of Large Language Models (LLMs) in Trusted Execution Environments (TEEs) for circuit design tasks, finding that distilled and quantized models can run efficiently in secure environments like Intel TDX, with up to 3x performance gains from quantization.

Details

Motivation: LLMs used in circuit design contain confidential intellectual property that needs protection, but existing TEE implementations are not optimized for resource-intensive LLM workloads. There's a need to evaluate how well LLMs can perform within secure computing environments while maintaining acceptable performance.

Method: Comprehensive evaluation of LLMs in three different environments: TEE-based (Intel TDX), CPU-only, and CPU-GPU hybrid implementations. Performance measured in tokens per second across different model types including distilled models (DeepSeek) and quantized models (4-bit Q4 and 8-bit Q8). Validation conducted using a testbench designed for SoC design tasks.

Result: Distilled models like DeepSeek outperform other models due to smaller parameter counts. Quantized models (Q4 and Q8) achieve up to 3x performance improvement compared to FP16 models. For models with fewer parameters such as DeepSeek-r1-1.5B, TDX implementation outperforms CPU-only versions in secure execution environments.

Conclusion: The study demonstrates the feasibility of efficiently deploying lightweight LLMs on resource-constrained systems within secure environments for semiconductor CAD applications. The combination of model distillation and quantization techniques enables practical deployment of confidential LLMs in TEE environments.

Abstract: Large Language Models (LLMs) are increasingly used in circuit design tasks and have typically undergone multiple rounds of training. Both the trained models and their associated training data are considered confidential intellectual property (IP) and must be protected from exposure. Confidential Computing offers a promising solution to protect data and models through Trusted Execution Environments (TEEs). However, existing TEE implementations are not designed to support the resource-intensive nature of LLMs efficiently. In this work, we first present a comprehensive evaluation of the LLMs within a TEE-enabled confidential computing environment, specifically utilizing Intel Trust Domain Extensions (TDX). We constructed experiments on three environments: TEE-based, CPU-only, and CPU-GPU hybrid implementations, and evaluated their performance in terms of tokens per second. Our first observation is that distilled models, i.e., DeepSeek, surpass other models in performance due to their smaller parameters, making them suitable for resource-constrained devices. Also, in the quantized models such as 4-bit quantization (Q4) and 8-bit quantization (Q8), we observed a performance gain of up to 3x compared to FP16 models. Our findings indicate that for fewer parameter sets, such as DeepSeek-r1-1.5B, the TDX implementation outperforms the CPU version in executing computations within a secure environment. We further validate the results using a testbench designed for SoC design tasks. These validations demonstrate the potential of efficiently deploying lightweight LLMs on resource-constrained systems for semiconductor CAD applications.

[251] Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery

Bo Wen, Chen Wang, Qiwei Han, Raquel Norel, Julia Liu, Thaddeus Stappenbeck, Jeffrey L. Rogers

Main category: cs.AI

TL;DR: This paper presents Agent PULSE, an LLM-powered voice assistant for healthcare that demonstrated 70% patient acceptance in a pilot study with IBD patients, showing potential for cost-effective preventive care and monitoring in underserved populations while addressing technical and policy challenges.

Details

Motivation: To bridge economic and accessibility gaps in digital health delivery by leveraging voice-based AI agents for preventive care and continuous patient monitoring, particularly targeting underserved populations where human intervention is economically unfeasible.

Method: Development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) through collaboration between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine, including economic modeling and testing with 33 inflammatory bowel disease patients to assess acceptance and preferences.

Result: 70% of patients expressed acceptance of AI-driven monitoring, with 37% preferring it over traditional modalities. The study identified technical challenges in real-time conversational AI processing, healthcare system integration, and privacy compliance, while demonstrating potential for significant cost savings in routine monitoring tasks.

Conclusion: Voice-based AI agents can enhance healthcare scalability, efficiency, patient engagement, and accessibility. When properly aligned with ethical and regulatory frameworks and addressing current limitations, these agents can serve as critical entry points for equitable and sustainable digital healthcare solutions.

Abstract: The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) – a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine – we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70% expressed acceptance of AI-driven monitoring, with 37% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.

[252] ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu

Main category: cs.AI

TL;DR: This paper introduces ResearcherBench, the first benchmark designed to evaluate Deep AI Research Systems (DARS) on frontier AI scientific questions, using 65 expert-selected research questions across 35 AI subjects with a dual evaluation framework combining rubric and factual assessments.

Details

Motivation: Existing benchmarks only evaluate AI systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. There's a need for specialized evaluation of advanced AI research systems on actual scientific research tasks.

Method: The researchers compiled a dataset of 65 research questions from real-world scientific scenarios (laboratory discussions and interviews) spanning 35 AI subjects, categorized into technical details, literature review, and open consulting. They developed a dual evaluation framework: rubric assessment for insight quality using expert-designed criteria, and factual assessment measuring citation accuracy (faithfulness) and coverage (groundedness).

Result: OpenAI Deep Research and Gemini Deep Research significantly outperformed other commercial DARS and baseline systems, showing particular strength in open-ended consulting questions. The evaluation demonstrated meaningful capabilities toward AI self-improvement.

Conclusion: ResearcherBench provides a standardized platform for evaluating next-generation AI research assistants and represents a step toward AI self-improvement, aligning with ASI vision. The benchmark is open-sourced to foster development of advanced AI research systems and promote a new pattern of scientific collaboration between humans and AI.

Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu

Main category: cs.AI

TL;DR: This paper proposes a cross-modal distillation framework that uses soft constraints and adaptive weighting to transfer knowledge between different modalities (image, text, speech) while avoiding overfitting issues common in traditional hard-constrained distillation methods.

Details

Motivation: Deep learning performance improvement through model scaling is becoming inefficient, and multi-modal learning can provide richer discriminative information. However, multi-modal data access is often limited during deployment, and traditional cross-modal knowledge distillation suffers from overfitting due to large domain gaps between different modalities.

Method: The authors develop a cross-modal distillation framework with: (1) soft constrained knowledge distillation strategies at both feature and classifier levels to replace hard constraints like L2 loss, and (2) a quality-based adaptive weights module that assigns weights to input samples based on quantified data quality for robust training.

Result: Experiments on speaker recognition and image classification tasks demonstrate effective knowledge transfer between image, text, and speech modalities. The proposed soft constraints and adaptive weighting successfully mitigate overfitting issues while maintaining effective cross-modal knowledge transfer.

Conclusion: The cross-modal distillation framework with soft constraints and quality-based adaptive weighting effectively addresses overfitting in cross-modal knowledge transfer, enabling robust learning across widely differing modalities like image, text, and speech.

Abstract: Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

[254] Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens

Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Main category: cs.AI

TL;DR: This study reveals that existing medical LLM benchmarks are biased toward high-income countries and poorly represent African disease burdens like malaria, HIV, and TB. The authors developed Alama Health QA, a new benchmark based on Kenyan clinical guidelines that better captures neglected tropical diseases (>40% NTD representation) compared to global benchmarks.

Details

Motivation: Existing medical LLM benchmarks reflect examination syllabi and disease profiles from high-income settings, raising validity concerns for African deployment where malaria, HIV, TB, sickle cell disease, and neglected tropical diseases dominate the disease burden and where national guidelines drive clinical care.

Method: The researchers systematically reviewed 31 quantitative LLM evaluation papers (Jan 2019-May 2025) and identified 19 English medical QA benchmarks. They developed Alama Health QA using a retrieval augmented generation framework anchored on Kenyan Clinical Practice Guidelines. Six benchmark sets underwent harmonized semantic profiling (NTD proportion, recency, readability, lexical diversity) and blinded expert rating across clinical relevance, guideline alignment, clarity, distractor plausibility, and language/cultural fit.

Result: Alama Health QA captured >40% of all NTD mentions across corpora with the highest frequencies for malaria (7.7%), HIV (4.1%), and TB (5.2%). AfriMedQA ranked second but lacked formal guideline linkage. Global benchmarks showed minimal representation of African diseases (e.g., sickle cell disease absent in three sets). Qualitatively, Alama scored highest for clinical relevance and guideline alignment, while PubMedQA scored lowest for clinical utility.

Conclusion: Quantitative medical LLM benchmarks widely used in literature underrepresent African disease burdens and regulatory contexts, risking misleading performance claims. Guideline-anchored, regionally curated resources like Alama Health QA and expanded disease-specific derivatives are essential for safe, equitable model evaluation and deployment across African health systems.

Abstract: Introduction: Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings, raising questions about their validity for African deployment where malaria, HIV, TB, sickle cell disease and other neglected tropical diseases (NTDs) dominate burden and national guidelines drive care. Methodology: We systematically reviewed 31 quantitative LLM evaluation papers (Jan 2019 May 2025) identifying 19 English medical QA benchmarks. Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines. Six widely used sets (AfriMedQA, MMLUMedical, PubMedQA, MedMCQA, MedQAUSMLE, and guideline grounded Alama Health QA) underwent harmonized semantic profiling (NTD proportion, recency, readability, lexical diversity metrics) and blinded expert rating across five dimensions: clinical relevance, guideline alignment, clarity, distractor plausibility, and language/cultural fit. Results: Alama Health QA captured >40% of all NTD mentions across corpora and the highest within set frequencies for malaria (7.7%), HIV (4.1%), and TB (5.2%); AfriMedQA ranked second but lacked formal guideline linkage. Global benchmarks showed minimal representation (e.g., sickle cell disease absent in three sets) despite large scale. Qualitatively, Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility. Discussion: Quantitative medical LLM benchmarks widely used in the literature underrepresent African disease burdens and regulatory contexts, risking misleading performance claims. Guideline anchored, regionally curated resources such as Alama Health QA and expanded disease specific derivatives are essential for safe, equitable model evaluation and deployment across African health systems.

[255] Higher Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.AI

TL;DR: This paper introduces Higher Gauge Flow Models that extend ordinary Gauge Flow Models using L∞-algebra to incorporate higher geometry and higher symmetries, demonstrating improved performance on Gaussian Mixture Model datasets compared to traditional Flow Models.

Details

Motivation: To extend the capabilities of Generative Flow Models by incorporating higher geometry and higher symmetries through L∞-algebra, building upon the limitations of ordinary Gauge Flow Models that only use Lie Algebra structures.

Method: The paper develops Higher Gauge Flow Models by leveraging L∞-algebra as an extension of Lie Algebra, which enables the integration of higher geometry and higher symmetries associated with higher groups into the Generative Flow Models framework.

Result: Experimental evaluation on a Gaussian Mixture Model dataset showed substantial performance improvements of Higher Gauge Flow Models compared to traditional Flow Models.

Conclusion: Higher Gauge Flow Models successfully extend ordinary Gauge Flow Models through L∞-algebra integration, providing a more powerful framework for generative modeling that demonstrates superior performance on benchmark datasets.

Abstract: This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

[256] Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health

Arpan Dasgupta, Mizhaan Maniyar, Awadhesh Srivastava, Sanat Kumar, Amrita Mahale, Aparna Hedge, Arun Suggala, Karthikeyan Shanmugam, Aparna Taneja, Milind Tambe

Main category: cs.AI

TL;DR: Researchers developed a collaborative bandit algorithm to optimize call timing for India’s Kilkari maternal health program, achieving significantly higher call pick-up rates compared to random scheduling in a field trial with 6,500 participants.

Details

Motivation: Current random call scheduling in India's Kilkari program (which delivers maternal health information via voice calls to millions of mothers) often results in missed calls and reduced message delivery, limiting the program's effectiveness in reaching underserved communities.

Method: Deployed a collaborative bandit algorithm in a field trial with approximately 6,500 Kilkari participants to learn individual mothers’ preferred call times and optimize scheduling, comparing performance against the baseline random calling approach.

Result: The bandit algorithm achieved statistically significant improvement in call pick-up rates compared to the random scheduling baseline, demonstrating enhanced message delivery effectiveness.

Conclusion: Personalized scheduling using machine learning algorithms can significantly improve mobile health intervention outcomes, with potential to enhance maternal health outreach for millions of mothers across India at scale.

Abstract: Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India’s Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers’ preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.

[257] Canonical Representations of Markovian Structural Causal Models: A Framework for Counterfactual Reasoning

Lucas de Lara

Main category: cs.AI

TL;DR: This paper proposes counterfactual models as an alternative to structural causal models for representing counterfactual reasoning in Pearl’s causal framework, allowing analysts to choose different counterfactual conceptions without altering observational and interventional constraints.

Details

Motivation: Many counterfactual statements cannot be falsified even by randomized experiments, yet they underpin fundamental concepts like individual-wise fairness. There is a need for better models to formalize and implement counterfactual beliefs in causal reasoning beyond traditional structural causal models.

Method: The authors introduce counterfactual models (canonical representations of structural causal models) that use random-process probability distributions with preassigned marginals to characterize counterfactual equivalence classes. They present a normalization procedure to describe and implement various counterfactual conceptions within the Markovian setting of Pearl’s causal framework.

Result: The proposed counterfactual models enable specification of many counterfactual conceptions without altering observational and interventional constraints. The counterfactual layer content doesn’t need estimation - only choice selection. The approach successfully demonstrates benefits on both theoretical and numerical examples.

Conclusion: Counterfactual models provide a flexible alternative to structural causal models for counterfactual reasoning, allowing analysts to choose appropriate counterfactual conceptions while maintaining compatibility with causal graphical models and preserving observational/interventional constraints.

Abstract: Counterfactual reasoning aims at answering contrary-to-fact questions like ‘‘Would have Alice recovered had she taken aspirin?’’ and corresponds to the most fine-grained layer of causation. Critically, while many counterfactual statements cannot be falsified – even by randomized experiments – they underpin fundamental concepts like individual-wise fairness. Therefore, providing models to formalize and implement counterfactual beliefs remains a fundamental scientific problem. In the Markovian setting of Pearl’s causal framework, we propose an alternative approach to structural causal models to represent counterfactuals compatible with a given causal graphical model. More precisely, we introduce counterfactual models, also called canonical representations of structural causal models. They enable analysts to choose a counterfactual conception via random-process probability distributions with preassigned marginals and characterize the counterfactual equivalence class of structural causal models. Then, we present a normalization procedure to describe and implement various counterfactual conceptions. Compared to structural causal models, it allows to specify many counterfactual conceptions without altering the observational and interventional constraints. Moreover, the content of the model corresponding to the counterfactual layer does not need to be estimated; only to make a choice. Finally, we illustrate the specific role of counterfactuals in causality and the benefits of our approach on theoretical and numerical examples.

[258] LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang

Main category: cs.AI

TL;DR: ColaUntangle is a collaborative multi-agent framework using LLMs to untangle mixed commits by modeling both explicit and implicit dependencies, achieving 44-100% improvement over existing methods on C# and Java datasets.

Details

Motivation: Developers frequently create tangled commits mixing unrelated changes due to practical constraints, which negatively impacts code review and maintenance. Existing untangling approaches rely on shallow signals and fail to distinguish between explicit dependencies (control/data flow) and implicit ones (semantic/conceptual relationships).

Method: ColaUntangle uses a multi-agent LLM architecture with three agents: one specializing in explicit dependencies, another in implicit dependencies, and a reviewer agent that synthesizes their perspectives through iterative consultation. The framework constructs multi-version Program Dependency Graphs (delta-PDG) to enable agents to reason over code relationships with both symbolic and semantic depth.

Result: ColaUntangle outperforms the best-performing baseline by 44% on the C# dataset (1,612 commits) and 100% on the Java dataset (14k commits), demonstrating significant improvements in commit untangling performance across two widely-used datasets.

Conclusion: The findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks, showing that modeling both explicit and implicit dependencies through multi-agent consultation can significantly improve commit untangling performance.

Abstract: Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.

[259] Self-Supervised Inductive Logic Programming

Stassa Patsantzis

Main category: cs.AI

TL;DR: This paper introduces Poker, a new Meta-Interpretive Learning (MIL) system that learns recursive logic programs in a self-supervised setting without requiring carefully selected background theories or negative examples, automatically generating training data during learning.

Details

Motivation: Traditional ILP approaches like MIL require expert-curated background theories and negative examples for effective learning. However, such problem-specific knowledge may not always be available, limiting the applicability of these methods in real-world scenarios where domain expertise is scarce.

Method: The authors formalize Self-Supervised ILP as a new learning setting and develop Poker, a MIL algorithm that learns from positive labeled examples and unlabeled data without negative examples. The system automatically generates and labels new positive and negative examples during learning. They also introduce Second Order Definite Normal Form (SONF) for principled selection of second-order background theories that are general enough to learn all programs in a class.

Result: Poker outperforms the state-of-the-art MIL system Louise on grammar learning tasks for Context-Free and L-System languages. Poker’s performance improves with increasing numbers of automatically generated examples, while Louise over-generalizes due to the lack of negative examples. The experiments use only positive example strings, terminal vocabulary, and first-order background theory.

Conclusion: The paper successfully demonstrates that self-supervised ILP is feasible and effective. Poker can learn recursive logic programs without expert-provided negative examples or problem-specific background theories, making ILP more accessible for domains lacking expert knowledge while maintaining good generalization performance.

Abstract: Inductive Logic Programming (ILP) approaches like Meta -/ Interpretive Learning (MIL) can learn, from few examples, recursive logic programs with invented predicates that generalise well to unseen instances. This ability relies on a background theory and negative examples, both carefully selected with expert knowledge of a learning problem and its solutions. But what if such a problem-specific background theory or negative examples are not available? We formalise this question as a new setting for Self-Supervised ILP and present a new MIL algorithm that learns in the new setting from some positive labelled, and zero or more unlabelled examples, and automatically generates, and labels, new positive and negative examples during learning. We implement this algorithm in Prolog in a new MIL system, called Poker. We compare Poker to state-of-the-art MIL system Louise on experiments learning grammars for Context-Free and L-System languages from labelled, positive example strings, no negative examples, and just the terminal vocabulary of a language, seen in examples, as a first-order background theory. We introduce a new approach for the principled selection of a second-order background theory as a Second Order Definite Normal Form (SONF), sufficiently general to learn all programs in a class, thus removing the need for a backgound theory tailored to a learning task. We find that Poker’s performance improves with increasing numbers of automatically generated examples while Louise, bereft of negative examples, over-generalises.

[260] Identifying Pre-training Data in LLMs: A Neuron Activation-Based Detection Framework

Hongyi Tang, Zhihao Zhu, Yi Yang

Main category: cs.AI

TL;DR: This paper introduces NA-PDD, a novel algorithm for detecting whether specific data was included in a large language model’s pre-training corpus by analyzing differential neuron activation patterns, achieving superior performance compared to existing methods that rely on superficial features.

Details

Motivation: Existing Pre-Training Data Detection (PDD) methods for identifying copyrighted or private information in LLM training data show mediocre performance because they rely on superficial features like prediction confidence and loss. There are growing legal and ethical concerns about LLMs being trained on copyrighted material, private information, and biased datasets, necessitating better detection methods.

Method: The paper proposes NA-PDD (Neuron Activation-based Pre-Training Data Detection), which analyzes differential neuron activation patterns between training and non-training data during LLM inference. The method is based on the observation that training and non-training data activate different neurons in LLMs. They also introduce CCNewsPDD, a temporally unbiased benchmark with rigorous data transformations to ensure consistent time distributions between training and non-training data.

Result: NA-PDD significantly outperforms existing PDD methods across three benchmarks and multiple LLMs. The experiments demonstrate the effectiveness of analyzing neuron activation patterns compared to traditional approaches that use superficial features like prediction confidence and loss.

Conclusion: The proposed NA-PDD algorithm successfully addresses the limitations of existing pre-training data detection methods by leveraging neuron activation patterns, providing a more reliable approach for identifying whether specific data was included in an LLM’s training corpus. This advancement helps address legal, ethical, and bias concerns related to LLM training data.

Abstract: The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM’s pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.

[261] From model-based learning to model-free behaviour with Meta-Interpretive Learning

Stassa Patsantzis

Main category: cs.AI

TL;DR: This paper presents a method to combine model-based and model-free reinforcement learning by using Meta-Interpretive Learning to train a model-based Solver that then trains a model-free Controller, achieving equivalent problem-solving capabilities on grid navigation tasks.

Details

Motivation: Autonomous agents need to act independently in novel environments, which requires combining model-based planning capabilities (for prediction and planning) with model-free capabilities (for acting without complete environment observation). Existing approaches typically use either model-based or model-free methods separately, missing the benefits of both approaches.

Method: The authors use Meta-Interpretive Learning to learn a model-based Solver, which is then used to train a model-free Controller. The approach transfers the planning knowledge from the model-based agent to the model-free agent, enabling the Controller to solve the same planning problems as the Solver without requiring a model or complete environment observation.

Result: The experiments on grid navigation problems in randomly generated mazes and lake maps with wide open areas show that all navigation problems solved by the model-based Solver are also solved by the model-free Controller, demonstrating equivalent problem-solving capabilities between the two agents.

Conclusion: The paper successfully demonstrates that it is possible to create autonomous agents that combine both model-based and model-free capabilities through knowledge transfer. The model-free Controller achieves equivalent performance to the model-based Solver while maintaining the advantages of not requiring a model or complete environment observation.

Abstract: A “model” is a theory that describes the state of an environment and the effects of an agent’s decisions on the environment. A model-based agent can use its model to predict the effects of its future actions and so plan ahead, but must know the state of the environment. A model-free agent cannot plan, but can act without a model and without completely observing the environment. An autonomous agent capable of acting independently in novel environments must combine both sets of capabilities. We show how to create such an agent with Meta-Interpretive Learning used to learn a model-based Solver used to train a model-free Controller that can solve the same planning problems as the Solver. We demonstrate the equivalence in problem-solving ability of the two agents on grid navigation problems in two kinds of environment: randomly generated mazes, and lake maps with wide open areas. We find that all navigation problems solved by the Solver are also solved by the Controller, indicating the two are equivalent.

[262] Improving ASP-based ORS Schedules through Machine Learning Predictions

Pierangela Bruno, Carmine Dodaro, Giuseppe Galatà, Marco Maratea, Marco Mochi

Main category: cs.AI

TL;DR: This paper integrates machine learning with Answer Set Programming (ASP) to solve Operating Room Scheduling problems by predicting surgery durations and generating more robust provisional schedules using confidence measures from predictions.

Details

Motivation: Existing ASP-based solutions for operating room scheduling can only verify schedules against actual data and suggest alternatives, but cannot generate provisional schedules or produce robust schedules for real-world application.

Method: The authors combine inductive and deductive techniques: first using machine learning algorithms to predict surgery duration from historical data, then incorporating prediction confidence as additional input to update ASP encoding for computing more robust schedules.

Result: Testing on historical data from ASL1 Liguria in Italy confirmed the viability of integrating machine learning predictions with ASP for operating room scheduling, enabling provisional schedule generation with improved robustness.

Conclusion: The integration of machine learning prediction capabilities with ASP-based scheduling successfully addresses the limitations of pure ASP approaches, enabling generation of provisional and more robust operating room schedules for practical deployment.

Abstract: The Operating Room Scheduling (ORS) problem deals with the optimization of daily operating room surgery schedules. It is a challenging problem subject to many constraints, like to determine the starting time of different surgeries and allocating the required resources, including the availability of beds in different department units. Recently, solutions to this problem based on Answer Set Programming (ASP) have been delivered. Such solutions are overall satisfying but, when applied to real data, they can currently only verify whether the encoding aligns with the actual data and, at most, suggest alternative schedules that could have been computed. As a consequence, it is not currently possible to generate provisional schedules. Furthermore, the resulting schedules are not always robust. In this paper, we integrate inductive and deductive techniques for solving these issues. We first employ machine learning algorithms to predict the surgery duration, from historical data, to compute provisional schedules. Then, we consider the confidence of such predictions as an additional input to our problem and update the encoding correspondingly in order to compute more robust schedules. Results on historical data from the ASL1 Liguria in Italy confirm the viability of our integration. Under consideration in Theory and Practice of Logic Programming (TPLP).

[263] Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs

Chang Li, Yaren Zhang, Haoran Lv, Qiong Cao, Chao Xue, Xiaodong He

Main category: cs.AI

TL;DR: This paper proposes a framework for efficient implicit reasoning in LLMs by modeling latent thoughts as hierarchical reinforcement learning options, avoiding the computational cost of explicit Chain-of-Thought text generation while maintaining reasoning performance.

Details

Motivation: Large Language Models show strong reasoning through Chain-of-Thought prompting, but generating step-by-step textual explanations is computationally expensive and slow. The authors aim to develop efficient implicit reasoning where models "think" in latent space without explicit text generation.

Method: The approach introduces Variational Markovian Option Critic (VMOC), an off-policy algorithm using variational inference within hierarchical reinforcement learning. They extend continuous MDP homomorphism theory to prove optimality preservation in abstract latent spaces, and propose a cold-start procedure using supervised fine-tuning data to initialize the latent option space with human reasoning demonstrations.

Result: Extensive experiments show strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, demonstrating the framework’s effectiveness for learning abstract skills in both language and control domains.

Conclusion: The framework provides a principled method for learning abstract reasoning skills that maintains performance while significantly reducing computational costs compared to explicit Chain-of-Thought approaches, with theoretical guarantees for optimality preservation in the latent reasoning space.

Abstract: Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model “thinks” in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model’s reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.

[264] ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training

Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina

Main category: cs.AI

TL;DR: ACT is an automated framework that improves code translation by fine-tuning open-source LLMs using synthetic data generation and dynamic pipeline control, providing a secure alternative to proprietary solutions while significantly enhancing translation performance.

Details

Motivation: Traditional code translation methods using handcrafted rules lack flexibility and scalability, while advanced language models are limited by proprietary implementations that raise data security concerns and create dependency issues. There's a need for accessible, high-performing open-source solutions for code translation.

Method: ACT framework includes: (1) synthetic data generation module that creates extensive, high-quality datasets from initial code samples with unit tests, (2) execution-level evaluation framework for comprehensive translation quality assessment, and (3) controller module that dynamically adjusts hyperparameters and orchestrates iterative data generation and fine-tuning based on real-time evaluations.

Result: ACT consistently enhances the effectiveness of open-source models, narrowing the performance gap with closed-source solutions. Application to industry-scale migration projects led to notable increases in developer acceleration, demonstrating practical value in real-world scenarios.

Conclusion: ACT provides businesses and developers with a secure, reliable alternative to proprietary code translation solutions by enabling effective fine-tuning of open-source LLMs. The framework successfully bridges the gap between open-source accessibility and high performance while offering practical benefits in software migration projects.

Abstract: Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs). ACT’s automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT’s evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer acceleration.

[265] Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications

Jean Lelong, Adnane Errazine, Annabelle Blangero

Main category: cs.AI

TL;DR: INRAExplorer is an advanced agentic RAG system that overcomes conventional RAG limitations by using LLM-based agents with multi-tool architecture to explore INRAE’s scientific knowledge base through comprehensive knowledge graphs, enabling complex query handling and multi-hop reasoning.

Details

Motivation: Conventional RAG systems fail on complex queries, providing only limited extractive answers and struggling with multiple targeted retrievals or navigating intricate entity relationships, creating a critical gap in knowledge-intensive domains that requires more sophisticated approaches.

Method: The paper introduces INRAExplorer, which employs an LLM-based agent with multi-tool architecture that dynamically engages a rich knowledge base through a comprehensive knowledge graph derived from open access INRAE publications, enabling iterative and targeted querying capabilities.

Result: INRAExplorer successfully conducts iterative, targeted queries, retrieves exhaustive datasets (such as all publications by an author), performs multi-hop reasoning, and delivers structured, comprehensive answers for exploring INRAE’s scientific data.

Conclusion: INRAExplorer demonstrates how agentic RAG systems can enhance knowledge interaction in specialized fields by overcoming the limitations of conventional RAG through sophisticated multi-tool architectures and knowledge graph integration.

Abstract: Conventional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) but often fall short on complex queries, delivering limited, extractive answers and struggling with multiple targeted retrievals or navigating intricate entity relationships. This is a critical gap in knowledge-intensive domains. We introduce INRAExplorer, an agentic RAG system for exploring the scientific data of INRAE (France’s National Research Institute for Agriculture, Food and Environment). INRAExplorer employs an LLM-based agent with a multi-tool architecture to dynamically engage a rich knowledge base, through a comprehensive knowledge graph derived from open access INRAE publications. This design empowers INRAExplorer to conduct iterative, targeted queries, retrieve exhaustive datasets (e.g., all publications by an author), perform multi-hop reasoning, and deliver structured, comprehensive answers. INRAExplorer serves as a concrete illustration of enhancing knowledge interaction in specialized fields.

[266] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou

Main category: cs.AI

TL;DR: This paper presents a comprehensive risk assessment framework for frontier AI models, evaluating seven critical risk areas using color-coded zones (green/yellow/red) based on risk thresholds, finding that current AI models are in manageable to moderate risk zones without crossing critical red lines.

Details

Motivation: The rapid advancement of AI models poses unprecedented risks that need systematic identification and assessment. There is an urgent need to understand and categorize these frontier risks to enable appropriate risk management and deployment decisions for increasingly capable AI systems.

Method: The authors use the E-T-C analysis framework (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework to assess seven risk areas: cyber offense, biological/chemical risks, persuasion/manipulation, uncontrolled autonomous AI R&D, strategic deception/scheming, self-replication, and collusion. They apply the “AI-45° Law” with red lines (intolerable thresholds) and yellow lines (early warning indicators) to create green/yellow/red risk zones for evaluation.

Result: All recent frontier AI models were found to reside in green and yellow zones without crossing red lines. No models crossed yellow lines for cyber offense or uncontrolled AI R&D. Most models remained in green zones for self-replication and strategic deception (except some reasoning models in yellow). Most models were in yellow zones for persuasion/manipulation due to effective human influence. For biological/chemical risks, most models potentially reside in yellow zones but require further detailed assessment.

Conclusion: Current frontier AI models present manageable to moderate risks without reaching critical thresholds, but continuous monitoring and strengthened mitigations are needed, especially for persuasion/manipulation and biological/chemical risks. The work emphasizes the need for collective action to address these evolving AI frontier risks as models continue to advance.

Abstract: To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

[267] Novel Multi-Agent Action Masked Deep Reinforcement Learning for General Industrial Assembly Lines Balancing Problems

Ali Mohamed Ali, Luca Tirel, Hashim A. Hashim

Main category: cs.AI

TL;DR: This paper presents a Deep Reinforcement Learning approach for optimizing industrial assembly line scheduling by modeling it as a Markov Decision Process, using action-masking and multi-agent techniques to improve training efficiency and achieve faster convergence than traditional methods.

Details

Motivation: Traditional Integer Programming solutions become computationally infeasible for large-scale assembly line scheduling, while heuristic methods like Genetic Algorithms often produce suboptimal solutions. There's a need for efficient planning methods that can handle modern industrial assembly lines without making restrictive assumptions about assembly line types.

Method: The paper introduces a novel mathematical model formulating industrial assembly lines as a Markov Decision Process (MDP) without type-specific assumptions. Two key innovations are proposed: (1) an action-masking technique to ensure agents select only feasible actions, and (2) a multi-agent approach where each workstation has an individual agent to reduce state/action spaces. A centralized training with decentralized execution framework is adopted for scalable learning.

Result: Numerical simulations demonstrate that the proposed Deep Reinforcement Learning scheme achieves significantly faster convergence to optimal solutions compared to comparable model-based approaches. The framework successfully enables offline learning with real-time operational solutions through neural network mapping.

Conclusion: The proposed DRL-based approach with action-masking and multi-agent techniques provides an effective and scalable solution for industrial assembly line optimization, offering faster convergence and real-time decision-making capabilities while maintaining optimality without restrictive assembly line type assumptions.

Abstract: Efficient planning of activities is essential for modern industrial assembly lines to uphold manufacturing standards, prevent project constraint violations, and achieve cost-effective operations. While exact solutions to such challenges can be obtained through Integer Programming (IP), the dependence of the search space on input parameters often makes IP computationally infeasible for large-scale scenarios. Heuristic methods, such as Genetic Algorithms, can also be applied, but they frequently produce suboptimal solutions in extensive cases. This paper introduces a novel mathematical model of a generic industrial assembly line formulated as a Markov Decision Process (MDP), without imposing assumptions on the type of assembly line a notable distinction from most existing models. The proposed model is employed to create a virtual environment for training Deep Reinforcement Learning (DRL) agents to optimize task and resource scheduling. To enhance the efficiency of agent training, the paper proposes two innovative tools. The first is an action-masking technique, which ensures the agent selects only feasible actions, thereby reducing training time. The second is a multi-agent approach, where each workstation is managed by an individual agent, as a result, the state and action spaces were reduced. A centralized training framework with decentralized execution is adopted, offering a scalable learning architecture for optimizing industrial assembly lines. This framework allows the agents to learn offline and subsequently provide real-time solutions during operations by leveraging a neural network that maps the current factory state to the optimal action. The effectiveness of the proposed scheme is validated through numerical simulations, demonstrating significantly faster convergence to the optimal solution compared to a comparable model-based approach.

[268] Adaptive Inventory Strategies using Deep Reinforcement Learning for Dynamic Agri-Food Supply Chains

Amandeep Kaur, Gyan Prakash

Main category: cs.AI

TL;DR: This paper proposes a novel Deep Reinforcement Learning algorithm that combines value-based and policy-based approaches to optimize inventory management for perishable agricultural products under demand and lead time uncertainties, aiming to maximize supply chain profitability while promoting stakeholder collaboration.

Details

Motivation: Agricultural products face seasonal fluctuations in production and demand, leading to challenging inventory management with risks of excess inventory or stockouts. Existing literature lacks consideration of coordination among supply chain stakeholders, and traditional approaches struggle with the complexity introduced by uncertainties and product perishability.

Method: The study develops a novel Deep Reinforcement Learning (DRL) algorithm that combines both value-based and policy-based DRL approaches. The algorithm uses continuous action space to select optimal order quantities and incorporates perishability constraints while addressing demand and lead time uncertainties. It aims to align stakeholder interests through shared profit maximization goals.

Result: Experimental results using empirical data from fresh agricultural products supply chain demonstrate improved performance of the proposed inventory replenishment policy under stochastic demand patterns and lead time scenarios compared to existing approaches.

Conclusion: The proposed DRL algorithm effectively addresses inventory optimization challenges in agri-food supply chains by maximizing profitability while considering perishability and uncertainties simultaneously. The research provides managerial implications for policymakers to better manage agricultural product inventory under uncertainty.

Abstract: Agricultural products are often subject to seasonal fluctuations in production and demand. Predicting and managing inventory levels in response to these variations can be challenging, leading to either excess inventory or stockouts. Additionally, the coordination among stakeholders at various level of food supply chain is not considered in the existing body of literature. To bridge these research gaps, this study focuses on inventory management of agri-food products under demand and lead time uncertainties. By implementing effective inventory replenishment policy results in maximize the overall profit throughout the supply chain. However, the complexity of the problem increases due to these uncertainties and shelf-life of the product, that makes challenging to implement traditional approaches to generate optimal set of solutions. Thus, the current study propose a novel Deep Reinforcement Learning (DRL) algorithm that combines the benefits of both value- and policy-based DRL approaches for inventory optimization under uncertainties. The proposed algorithm can incentivize collaboration among stakeholders by aligning their interests and objectives through shared optimization goal of maximizing profitability along the agri-food supply chain while considering perishability, and uncertainty simultaneously. By selecting optimal order quantities with continuous action space, the proposed algorithm effectively addresses the inventory optimization challenges. To rigorously evaluate this algorithm, the empirical data from fresh agricultural products supply chain inventory is considered. Experimental results corroborate the improved performance of the proposed inventory replenishment policy under stochastic demand patterns and lead time scenarios. The research findings hold managerial implications for policymakers to manage the inventory of agricultural products more effectively under uncertainty.

[269] Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yinchun Wang

Main category: cs.AI

TL;DR: This paper introduces Deliberative Searcher, a framework that combines certainty calibration with retrieval-based search for open-domain QA, using reinforcement learning to improve LLM reliability by better aligning model confidence with correctness.

Details

Motivation: Large language models lack reliability for real-world deployment, particularly in their ability to accurately assess their own confidence levels, making it difficult to trust their outputs in critical applications.

Method: The authors propose Deliberative Searcher, which integrates certainty calibration with retrieval-based search over Wikipedia data. The agent performs multi-step reflection and verification processes and is trained using reinforcement learning with a soft reliability constraint to optimize accuracy.

Result: Empirical results demonstrate improved alignment between model confidence and correctness in open-domain question answering tasks, leading to more trustworthy model outputs.

Conclusion: The Deliberative Searcher framework successfully enhances LLM reliability by better calibrating confidence with actual performance, making the models more suitable for real-world deployment where trustworthy outputs are essential.

Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimizes for accuracy under a soft reliability constraint. Empirical results show that proposed method improves alignment between model confidence and correctness, leading to more trustworthy outputs. This paper will be continuously updated.

[270] WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun

Main category: cs.AI

TL;DR: The paper introduces wgrammar, a lightweight decoding engine that speeds up structured output generation in LLMs by up to 250x through decomposing constraints into static and dynamic components, using compositional operators instead of pushdown automata, and implementing domain-aware optimizations.

Details

Motivation: Existing structured decoding methods for LLMs suffer from efficiency bottlenecks in grammar compilation, state tracking, and mask creation when generating formatted outputs like HTML or JSON, creating a need for more efficient approaches that can leverage real-world task structure.

Method: The approach decomposes constraints into static and dynamic components, precompiling static structures offline while instantiating dynamic arguments at runtime using grammar snippets. It replaces pushdown automata with compositional operators for modeling regular formats and integrates domain-aware simplification, constraint decomposition, and mask caching.

Result: The wgrammar system achieves up to 250x speedup over existing structured decoding systems while maintaining the ability to generate properly formatted outputs required by downstream systems.

Conclusion: By leveraging prior knowledge about output structure and decomposing constraints strategically, wgrammar demonstrates that structured decoding can be made significantly more efficient without sacrificing functionality, with the system being made publicly available for broader adoption.

Abstract: Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components – precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar’s source code is publicly available at https://github.com/wrran/wgrammar.

[271] ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation

Roman Mayr, Michel Schimpf, Thomas Bohné

Main category: cs.AI

TL;DR: ChatChecker is an automated framework that uses LLMs to simulate user interactions and evaluate complex dialogue systems at the dialogue level, improving breakdown detection and uncovering system weaknesses through non-cooperative user simulation.

Details

Motivation: Modern dialogue systems integrate multiple LLMs, external tools, and databases, making isolated LLM evaluation insufficient. Previous work focused on turn-level analysis rather than integrated dialogue-level quality assurance, creating a need for comprehensive testing of dialogue systems as complete units.

Method: ChatChecker framework uses LLMs to simulate diverse user interactions, incorporates an error taxonomy in prompts for improved breakdown detection, and employs a novel non-cooperative user simulator with challenging personas to test dialogue systems without requiring reference dialogues or coupling to system implementation.

Result: The framework demonstrates improved breakdown detection performance over prior LLM-based approaches by including error taxonomy in prompts. The non-cooperative user simulator effectively uncovers weaknesses in target dialogue systems, providing more thorough testing capabilities.

Conclusion: ChatChecker enables scalable and thorough testing of dialogue systems with reduced setup effort and high generalizability, contributing to accelerated development of robust dialogue systems for both researchers and practitioners.

Abstract: While modern dialogue systems heavily rely on large language models (LLMs), their implementation often goes beyond pure LLM interaction. Developers integrate multiple LLMs, external tools, and databases. Therefore, assessment of the underlying LLM alone does not suffice, and the dialogue systems must be tested and evaluated as a whole. However, this remains a major challenge. With most previous work focusing on turn-level analysis, less attention has been paid to integrated dialogue-level quality assurance. To address this, we present ChatChecker, a framework for automated evaluation and testing of complex dialogue systems. ChatChecker uses LLMs to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality. Compared to previous approaches, our design reduces setup effort and is generalizable, as it does not require reference dialogues and is decoupled from the implementation of the target dialogue system. We improve breakdown detection performance over a prior LLM-based approach by including an error taxonomy in the prompt. Additionally, we propose a novel non-cooperative user simulator based on challenging personas that uncovers weaknesses in target dialogue systems more effectively. Through this, ChatChecker contributes to thorough and scalable testing. This enables both researchers and practitioners to accelerate the development of robust dialogue systems.

[272] Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning

Mian Ibad Ali Shah, Enda Barrett, Karl Mason

Main category: cs.AI

TL;DR: This paper proposes a novel P2P energy trading framework that combines uncertainty-aware prediction using a Knowledge Transformer with Uncertainty (KTU) model and multi-agent reinforcement learning (MARL) to optimize trading strategies, achieving significant cost reductions and revenue increases compared to deterministic forecasting approaches.

Details

Motivation: Current P2P energy trading systems rely on deterministic forecasts, which fail to address the inherent uncertainty and stochastic nature of energy markets. This creates a critical gap in robust decision-making for energy trading, as agents cannot properly assess risk and variability when making trading decisions.

Method: The framework integrates a heteroscedastic probabilistic transformer-based prediction model (KTU) that quantifies prediction uncertainty with multi-agent reinforcement learning. The KTU model uses domain-specific features and a custom loss function to provide reliable probabilistic forecasts and confidence intervals. These uncertainty-aware predictions are then fed into a MARL framework using Deep Q-Network (DQN) to enable agents to optimize trading strategies while understanding risk.

Result: The uncertainty-aware DQN achieved: 5.7% reduction in energy purchase costs without P2P trading and 3.2% with P2P trading; 6.4% and 44.7% increase in electricity sales revenue respectively; 38.8% reduction in peak hour grid demand without P2P and 45.6% with P2P trading. The improvements were more pronounced when P2P trading was enabled.

Conclusion: The integration of uncertainty-aware forecasting with MARL creates synergistic effects that significantly improve economic efficiency and grid resilience in P2P energy trading. The approach demonstrates that explicitly modeling prediction uncertainty is crucial for robust energy trading decisions, with enhanced benefits when combined with P2P market mechanisms.

Abstract: This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.

[273] Toward A Causal Framework for Modeling Perception

Jose M. Alvarez, Salvatore Ruggieri

Main category: cs.AI

TL;DR: This paper introduces a causal framework for modeling perception in machine learning systems, defining how different experts may interpret the same ML outputs differently due to their individual experiences, with implications for fairness and bias in human-ML decision-making processes.

Details

Motivation: Perception - where individuals interpret the same information differently - is a known cognitive phenomenon that affects human decision-making bias, but remains understudied in ML. This is problematic since modern decision flows involve human experts who may interpret ML model outputs (like deferred instances or explanations) differently, creating a gap in understanding human-ML interaction.

Method: The authors formalize perception using structural causal models (SCMs) under causal reasoning. They model individual experience as additional causal knowledge that experts bring to decision-making in the form of SCMs. They define two types of probabilistic causal perception: structural perception and parametrical perception, integrating these concepts into ML-enabled decision flows.

Result: The framework is demonstrated through examples of modern decision flows, showing how the causal perception model can be applied to real-world scenarios. The work establishes a formal foundation for understanding how individual expert experiences influence interpretation of ML outputs.

Conclusion: The paper establishes the first causal approach to modeling perception in ML contexts, emphasizing its importance for fair ML applications. The framework provides a foundation for addressing bias and fairness implications that arise when human experts with different perceptual backgrounds interact with ML systems in decision-making processes.

Abstract: Perception occurs when individuals interpret the same information differently. It is a known cognitive phenomenon with implications for bias in human decision-making. Perception, however, remains understudied in machine learning (ML). This is problematic as modern decision flows, whether partially or fully automated by ML applications, always involve human experts. How might we account for cases in which two experts, e.g., interpret differently the same deferred instance or explanation from a ML model? Addressing this and similar questions requires a formulation of perception, particularly, in a manner that integrates with ML-enabled decision flows. In this work, we present a first approach to modeling perception causally. We define perception under causal reasoning using structural causal models (SCM). Our approach formalizes individual experience as additional causal knowledge that comes with and is used by the expert decision-maker in the form of a SCM. We define two kinds of probabilistic causal perception: structural perception and parametrical perception. We showcase our framework through a series of examples of modern decision flows. We also emphasize the importance of addressing perception in fair ML, discussing relevant fairness implications and possible applications.

[274] Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia

Main category: cs.AI

TL;DR: Alto is a framework that automatically optimizes compound AI applications through streaming and parallelism using a novel “nested ancestry” abstraction, achieving 10-30% latency improvements over existing frameworks like LangGraph.

Details

Motivation: Compound AI applications that chain together multiple components (language models, retrievers, embedding models) face optimization challenges because each component has different data constraints and granularities. Existing systems fail to fully exploit parallelism and pipelining opportunities due to the complexity of managing intermediate data generation and text fragmentation/aggregation.

Method: The paper introduces Alto framework with a “nested ancestry” abstraction - a metadata hierarchy that tracks partial outputs and aggregates data across heterogeneous component constraints. This metadata is automatically inferred from the programming model, enabling developers to express complex dataflow patterns without manual routing and aggregation reasoning.

Result: Alto implementations of four applications outperform or match LangGraph implementations, achieving latency improvements of 10-30% while maintaining equivalent performance quality.

Conclusion: Alto successfully addresses the optimization challenges in compound AI systems by providing automatic streaming and parallelism optimization through the nested ancestry abstraction, demonstrating significant performance improvements over existing frameworks while simplifying development complexity.

Abstract: Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

[275] Efficient Strategy Learning by Decoupling Searching and Pathfinding for Object Navigation

Yanwei Zheng, Shaopu Feng, Bowen Huang, Chuanlin Lan, Xiao Zhang, Dongxiao Yu

Main category: cs.AI

TL;DR: This paper proposes a two-stage navigation approach with separate reward mechanisms for searching and pathfinding stages, enhanced by depth-aware visual encoding, achieving superior performance in object navigation tasks.

Details

Motivation: Existing navigation models use parallel submodules for searching and pathfinding but ignore the differences in reward signals between stages, leading to incomplete training or overfitting. Additionally, generic visual encoders lack depth information crucial for spatial perception in navigation scenes.

Method: The authors propose Two-Stage Reward Mechanism (TSRM) that decouples searching and pathfinding behaviors with stage-specific rewards, and Depth Enhanced Masked Autoencoders (DE-MAE) for better spatial perception using depth information. They also introduce a new evaluation metric SSSPL (Searching Success weighted by Searching Path Length).

Result: Extensive evaluation on AI2-Thor and RoboTHOR datasets shows the proposed method outperforms state-of-the-art approaches in both success rate and navigation efficiency, demonstrating improved searching ability and exploring efficiency.

Conclusion: The two-stage reward mechanism successfully enables agents to explore larger areas during searching and find optimal paths during pathfinding, while depth-enhanced visual encoding significantly improves spatial perception for navigation tasks.

Abstract: Inspired by human-like behaviors for navigation: first searching to explore unknown areas before discovering the target, and then the pathfinding of moving towards the discovered target, recent studies design parallel submodules to achieve different functions in the searching and pathfinding stages, while ignoring the differences in reward signals between the two stages. As a result, these models often cannot be fully trained or are overfitting on training scenes. Another bottleneck that restricts agents from learning two-stage strategies is spatial perception ability, since the studies used generic visual encoders without considering the depth information of navigation scenes. To release the potential of the model on strategy learning, we propose the Two-Stage Reward Mechanism (TSRM) for object navigation that decouples the searching and pathfinding behaviours in an episode, enabling the agent to explore larger area in searching stage and seek the optimal path in pathfinding stage. Also, we propose a pretraining method Depth Enhanced Masked Autoencoders (DE-MAE) that enables agent to determine explored and unexplored areas during the searching stage, locate target object and plan paths during the pathfinding stage more accurately. In addition, we propose a new metric of Searching Success weighted by Searching Path Length (SSSPL) that assesses agent’s searching ability and exploring efficiency. Finally, we evaluated our method on AI2-Thor and RoboTHOR extensively and demonstrated it can outperform the state-of-the-art (SOTA) methods in both the success rate and the navigation efficiency.

[276] R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory

Tenghao Huang, Kinjal Basu, Ibrahim Abdelaziz, Pavan Kapanipathi, Jonathan May, Muhao Chen

Main category: cs.AI

TL;DR: The paper introduces R2D2, a web agent framework that combines memory-based navigation (Remember) and error-learning mechanisms (Reflect) to improve web interaction performance, achieving 50% fewer navigation errors and 3x higher task completion rates on WebArena benchmark.

Details

Motivation: Current web agents struggle with efficient navigation and action execution due to limited visibility and understanding of complex web structures, leading to navigational errors and poor decision-making during web interactions.

Method: R2D2 framework integrates two paradigms: (1) Remember - uses a replay buffer to help agents dynamically reconstruct web environments and create detailed maps of visited pages, and (2) Reflect - enables agents to learn from past mistakes through error analysis and strategy refinement mechanisms.

Result: Evaluation on WebArena benchmark shows substantial improvements: 50% reduction in navigation errors and threefold increase in task completion rates compared to existing methods.

Conclusion: The combination of memory-enhanced navigation and reflective learning significantly advances web agent capabilities, with potential applications in automated customer service and personal digital assistants.

Abstract: The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm uses a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed “map” of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WebArena benchmark, demonstrating substantial improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.

[277] InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai

Main category: cs.AI

TL;DR: InternAgent is a unified multi-agent AI framework that automates scientific research across multiple domains, demonstrating scalability across 12 research tasks, human-AI interactivity, and significant efficiency gains with performance improvements achieved in hours rather than traditional lengthy research cycles.

Details

Motivation: AI is transforming scientific research by enhancing efficiency and driving innovation, but there's a need for a unified framework that can autonomously conduct scientific research across various fields while allowing human expert integration and achieving faster results than traditional research methods.

Method: InternAgent employs a unified closed-loop multi-agent framework designed for Autonomous Scientific Research (ASR). The system integrates human expert feedback through an interactive interface and enables multi-agent collaboration in automated end-to-end processes, allowing seamless incorporation of domain expert knowledge while maintaining autonomous operation.

Result: InternAgent demonstrated effectiveness across 12 scientific research tasks with significant performance improvements: reaction yield prediction increased from 27.6% to 35.4% in 12 hours, enhancer activity prediction accuracy rose from 0.65 to 0.79 in 4 hours, and 2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours, all with substantially reduced time costs compared to human efforts.

Conclusion: InternAgent successfully establishes a scalable, interactive, and efficient framework for autonomous scientific research that can generate innovative ideas, integrate human expertise, and achieve promising performance gains across multiple scientific domains with dramatically reduced time requirements compared to traditional research approaches.

Abstract: Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

[278] DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee

Main category: cs.AI

TL;DR: This paper proposes a Deep Contextual Schema Link Graph approach to improve Text-to-SQL performance by better retrieving demonstrations for in-context learning, showing consistent improvements across both large and small language models on the Spider benchmark.

Details

Motivation: Existing Text-to-SQL methods with in-context learning show little improvement over random demonstrations and significant performance drops on smaller LLMs, indicating heavy reliance on hyper-scaled LLMs' intrinsic capabilities rather than effective demonstration retrieval.

Method: The paper constructs a Deep Contextual Schema Link Graph that captures key information and semantic relationships between natural language questions and database schema items, enabling graph-based representation of Text-to-SQL samples and effective retrieval of useful demonstrations for in-context learning.

Result: Experimental results on the Spider benchmark demonstrate consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs, validating the effectiveness of the proposed graph-based approach.

Conclusion: The Deep Contextual Schema Link Graph approach successfully addresses the limitations of existing Text-to-SQL methods by enabling more effective demonstration retrieval, leading to improved performance across different model sizes and reducing dependence on hyper-scaled LLMs.

Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.

[279] The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning

Edward Y. Chang, Zeyneb N. Kaya, Ethan Chang

Main category: cs.AI

TL;DR: The paper proposes Unified Cognitive Consciousness Theory (UCCT), which views LLMs as unconscious pattern repositories that require external mechanisms (few-shot prompting, RAG, fine-tuning, multi-agent reasoning) to achieve intelligent behavior through semantic anchoring, with a Bayesian formulation predicting 1/sqrt(n) scaling for capability transitions.

Details

Motivation: Large language models lack explicit reasoning, semantic grounding, and goal-directed intelligence despite containing vast latent patterns. There's a need for a unified theoretical framework to explain how various techniques transform LLMs from unconscious substrates into intelligent systems.

Method: The authors develop UCCT using a Bayesian formulation to model the semantic anchoring process. They unify disparate techniques (few-shot prompting, RAG, fine-tuning, multi-agent reasoning) as special cases of a general anchoring architecture. The theory predicts threshold-crossing dynamics with 1/sqrt(n) scaling for capability transitions.

Result: UCCT successfully explains sudden capability transitions observed across diverse tasks through its threshold-crossing dynamic. Case studies in simple math, visual recognition, and structured debate tasks confirm the theory’s predictive power. Arithmetic experiments across three numeral systems validate UCCT’s theoretical predictions.

Conclusion: LLMs are fundamentally unconscious pattern repositories without inherent intelligence. Intelligence emerges only when external anchoring mechanisms assign target semantics to latent patterns, transforming unconscious representations into conscious, goal-directed capabilities. UCCT provides a unified framework for understanding how various techniques enable this transformation.

Abstract: Large language models (LLMs) are vast repositories of latent patterns, but without structured guidance, they lack explicit reasoning, semantic grounding, and goal-directed intelligence. We propose Unified Cognitive Consciousness Theory (UCCT), a unified model that reinterprets LLMs as unconscious substrates requiring external mechanisms, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, to semantically anchor latent representations. UCCT formalizes this anchoring process through a Bayesian formulation, revealing a threshold-crossing dynamic characterized by 1/sqrt(n) scaling that explains the sudden capability transitions observed across diverse tasks. The theory unifies previously disparate techniques, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, as special cases of a general anchoring architecture. Through case studies in simple math, visual recognition, and structured debate tasks, we confirm the predictive power of UCCT. Furthermore, our experiment in arithmetic in three numeral systems validates the theories of UCCT. Rather than treating intelligence as an intrinsic property of LLMs, UCCT demonstrates that LLMs are merely unconscious pattern repositories with no inherent intelligence. Intelligence emerges only when external anchoring mechanisms assign target semantics to these latent patterns, transforming unconscious representations into conscious, goal-directed capabilities.

[280] Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori

Main category: cs.AI

TL;DR: The paper introduces Hierarchical Reasoning Model (HRM), a 27M parameter recurrent architecture inspired by human brain processing that achieves exceptional reasoning performance using only 1000 training samples, without requiring pre-training or Chain-of-Thought data.

Details

Motivation: Current large language models using Chain-of-Thought techniques suffer from brittle task decomposition, extensive data requirements, and high latency. There's a need for more efficient reasoning systems that can handle complex goal-oriented action sequences with better computational efficiency and training stability.

Method: HRM uses a novel recurrent architecture with two interdependent modules: a high-level module for slow, abstract planning and a low-level module for rapid, detailed computations. The model executes sequential reasoning in a single forward pass without explicit supervision of intermediate processes, inspired by hierarchical and multi-timescale processing in the human brain.

Result: With only 27 million parameters and 1000 training samples, HRM achieves nearly perfect performance on complex Sudoku puzzles and optimal path finding in large mazes. It outperforms much larger models with longer context windows on the Abstraction and Reasoning Corpus (ARC) benchmark, demonstrating superior efficiency and effectiveness.

Conclusion: HRM represents a transformative advancement toward universal computation and general-purpose reasoning systems, showing that brain-inspired hierarchical architectures can achieve exceptional reasoning capabilities with minimal computational resources and training data compared to current large language models.

Abstract: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.

[281] An Integrated Framework of Prompt Engineering and Multidimensional Knowledge Graphs for Legal Dispute Analysis

Mingda Zhang, Na Zhao, Jianglong Qing, Qing xu, Kaiwen Pan, Ting luo

Main category: cs.AI

TL;DR: A framework combining hierarchical prompt engineering with multidimensional knowledge graphs to enhance LLMs’ legal dispute analysis, achieving significant improvements in sensitivity (9.9%-13.8%), specificity (4.8%-6.7%), and citation accuracy (22.4%-39.7%).

Details

Motivation: Current LLMs lack sufficient capability in legal dispute analysis and understanding of judicial logic, requiring better integration of legal knowledge and reasoning structures to improve performance in intelligent legal assistance systems.

Method: A framework with two main components: (1) three-stage hierarchical prompt structure including task definition, knowledge background, and reasoning guidance; (2) three-layer knowledge graph with legal ontology, representation, and instance layers; supported by four legal concept retrieval methods: direct code matching, semantic vector similarity, ontology path reasoning, and lexical segmentation.

Result: Extensive testing demonstrated major performance improvements: sensitivity increased by 9.9%-13.8%, specificity increased by 4.8%-6.7%, and citation accuracy improved by 22.4%-39.7% compared to baseline approaches.

Conclusion: The proposed framework successfully enhances LLMs’ legal analysis capabilities and judicial logic understanding, providing a new technical approach for developing intelligent legal assistance systems with improved accuracy and reliability.

Abstract: This research presents a framework combining prompt engineering with multidimensional knowledge graphs to improve LLMs’ legal dispute analysis. Specifically, the framework includes a three-stage hierarchical prompt structure (task definition, knowledge background, reasoning guidance) along with a three-layer knowledge graph (legal ontology, representation, instance layers). Additionally, four supporting methods enable precise legal concept retrieval: direct code matching, semantic vector similarity, ontology path reasoning, and lexical segmentation. Through extensive testing, results show major improvements: sensitivity increased by 9.9%-13.8%, specificity by 4.8%-6.7%, and citation accuracy by 22.4%-39.7%. As a result, the framework provides better legal analysis and understanding of judicial logic, thus offering a new technical method for intelligent legal assistance systems.

[282] A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang

Main category: cs.AI

TL;DR: A framework combining multi-granularity sparse activation with hierarchical knowledge graphs improves rare disease diagnosis in medical large language models, achieving 0.92 diagnostic accuracy and surpassing clinical thresholds.

Details

Motivation: Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning capabilities.

Method: The framework combines multi-granularity sparse activation with hierarchical knowledge graphs, employing four complementary matching algorithms with diversity control, a five-level fallback strategy for concept activation, and a three-layer knowledge graph structure (taxonomy, clinical features, instances) for structured context.

Result: Significant improvements on BioASQ rare disease dataset: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy (surpassing the 0.90 clinical threshold). Expert evaluation confirmed enhancements in information quality, reasoning, and professional expression.

Conclusion: The framework shows promise in reducing the diagnostic odyssey for rare disease patients by significantly improving diagnostic accuracy and clinical reasoning capabilities of medical large language models.

Abstract: Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning. We propose a framework combining multi-granularity sparse activation with hierarchical knowledge graphs. Our approach employs four complementary matching algorithms with diversity control and a five-level fallback strategy for precise concept activation. A three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare disease dataset demonstrate significant improvements: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, and diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy–surpassing the 0.90 clinical threshold. Expert evaluation confirms enhancements in information quality, reasoning, and professional expression. Our framework shows promise in reducing the diagnostic odyssey for rare disease patients.

[283] Assessing Adaptive World Models in Machines with Novel Games

Lance Ying, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum

Main category: cs.AI

TL;DR: This paper proposes a new evaluation framework for AI world models based on “novel games” that test rapid adaptation and world model induction, arguing that current AI evaluation focuses too narrowly on static representations rather than efficient learning through interaction.

Details

Motivation: Current AI world model evaluation is limited, focusing on static representations learned from massive datasets rather than the efficiency and efficacy of learning representations through interaction and exploration in novel environments. Human intelligence shows remarkable rapid adaptation through efficient world model construction, but AI lacks proper evaluation frameworks for this capability.

Method: The authors propose a new benchmarking paradigm based on “novel games” - carefully designed game suites with genuine, deep, and continually refreshing novelty in underlying structures. They draw insights from cognitive science research on human learning and adaptation, and define key desiderata for constructing these games along with appropriate metrics to evaluate rapid world model induction.

Result: The paper presents a theoretical framework and evaluation methodology rather than empirical results. It provides a perspective on world model induction inspired by cognitive science and outlines the design principles for novel games that can properly assess adaptive world models in AI systems.

Conclusion: The proposed evaluation framework using novel games represents a crucial step toward developing AI systems with human-like rapid adaptation and robust generalization capabilities. This new approach to evaluating world models could inspire future research and contribute to progress toward artificial general intelligence by focusing on dynamic learning rather than static representation capabilities.

Abstract: Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures – we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent’s ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization – a critical component of artificial general intelligence.

[284] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.AI

TL;DR: CUDA-L1 is a reinforcement learning framework that automatically optimizes CUDA code, achieving an average 17.7x speedup across 250 kernels on A100 GPUs and demonstrating excellent portability across different GPU architectures without requiring human expertise.

Details

Motivation: The exponential growth in GPU computing demand driven by Large Language Models has created an urgent need for automated CUDA optimization strategies, as current state-of-the-art models achieve low success rates in improving CUDA performance.

Method: The paper introduces CUDA-L1, an automated reinforcement learning framework that uses speedup-based reward signals to train LLMs for CUDA optimization without requiring human expertise or domain knowledge.

Result: CUDA-L1 achieves an average speedup of 17.7x across all 250 CUDA kernels of KernelBench on A100, with peak speedups reaching 449x. It demonstrates excellent cross-architecture portability with speedups of 17.8x on H100, 19.0x on RTX 3090, 16.5x on L40, 14.7x on H800, and 13.9x on H20.

Conclusion: CUDA-L1 demonstrates that reinforcement learning can transform poor-performing LLMs into effective CUDA optimizers through speedup-based rewards alone, opening possibilities for automated CUDA optimization and substantially improving GPU efficiency to alleviate computing resource pressure.

Abstract: The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

[285] Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

Main category: cs.AI

TL;DR: This paper introduces Routine, a multi-step agent planning framework that significantly improves tool-calling accuracy in enterprise environments by providing structured execution guidance, achieving 96.3% accuracy with GPT-4o and 83.3% with Qwen3-14B.

Details

Motivation: Enterprise agent systems face challenges including lack of domain-specific process knowledge, disorganized plans, missing key tools, and poor execution stability, which hinder their deployment in real-world business environments.

Method: The authors developed Routine, a multi-step agent planning framework with clear structure, explicit instructions, and seamless parameter passing. They also created a Routine-following training dataset for fine-tuning and employed Routine-based distillation to generate scenario-specific tool-calling datasets.

Result: Routine dramatically improved execution accuracy: GPT-4o increased from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. Fine-tuning Qwen3-14B on the Routine dataset achieved 88.2% accuracy, while distillation-based fine-tuning reached 95.5% accuracy, approaching GPT-4o performance.

Conclusion: Routine provides an effective approach for building stable agent workflows in enterprise environments, successfully distilling domain-specific tool-usage patterns and enhancing model adaptability, thereby accelerating agent system deployment and advancing AI for Process automation.

Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent’s execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model’s accuracy to 95.5%, approaching GPT-4o’s performance. These results highlight Routine’s effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

[286] BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning

Yitong Lin, Jiaying He, Jiahe Chen, Xinnan Zhu, Jianwei Zheng, Tao Bo

Main category: cs.AI

TL;DR: BioGraphFusion is a novel framework that combines semantic understanding and structural learning for biomedical knowledge graphs through tensor decomposition and LSTM-driven mechanisms, outperforming existing methods in drug discovery and disease understanding tasks.

Details

Motivation: Existing biomedical knowledge graph methods face limitations: Knowledge Embedding methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks excel locally but lack semantic understanding. Even ensemble approaches fail to achieve deep, adaptive co-evolution between semantic comprehension and structural learning in complex biomedical KGs.

Method: BioGraphFusion establishes a global semantic foundation via tensor decomposition and uses an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. The framework incorporates query-guided subgraph construction and a hybrid scoring mechanism to foster adaptive interplay between semantic understanding and structural learning.

Result: BioGraphFusion demonstrates superior performance over state-of-the-art Knowledge Embedding, Graph Neural Network, and ensemble models across three key biomedical tasks. A case study on Cutaneous Malignant Melanoma 1 (CMM1) shows the framework’s ability to unveil biologically meaningful pathways.

Conclusion: BioGraphFusion successfully addresses the critical gap in biomedical knowledge graph completion by achieving deep synergistic integration of semantic and structural learning, providing a promising solution for drug discovery and disease understanding applications.

Abstract: Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount. Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion’s superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways. Availability and Implementation: Source code and all training data are freely available for download at https://github.com/Y-TARL/BioGraphFusion. Supplementary information: Supplementary data are available at Bioinformatics online.

[287] Hierarchical Budget Policy Optimization for Adaptive Reasoning

Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.AI

TL;DR: The paper presents HBPO, a reinforcement learning framework that teaches large reasoning models to adaptively adjust their reasoning depth based on problem complexity, achieving 60.6% reduction in token usage while improving accuracy by 3.14%.

Details

Motivation: Large reasoning models apply uniform reasoning strategies regardless of problem complexity, leading to significant computational inefficiency. The challenge is that efficiency-oriented training causes exploration space collapse, where penalties on long outputs bias models away from necessary long reasoning paths.

Method: Hierarchical Budget Policy Optimization (HBPO) uses reinforcement learning with hierarchical budget exploration that partitions rollout samples into subgroups with distinct token budgets. It employs differentiated reward mechanisms that create budget-aware incentives aligned with problem complexity, enabling models to learn problem-specific reasoning depths.

Result: HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. The method demonstrates emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity without external constraints or discrete mode selection.

Conclusion: Reasoning efficiency and capability are not inherently conflicting and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity. The approach enables natural correspondences between task requirements and computational effort.

Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

[288] Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

Yichen Huang, Lin F. Yang

Main category: cs.AI

TL;DR: This paper demonstrates that Google’s Gemini 2.5 Pro can solve 5 out of 6 problems from the newly released IMO 2025 using a self-verification pipeline with careful prompt design, showing significant progress in LLM mathematical reasoning capabilities.

Details

Motivation: Large Language Models struggle with International Mathematical Olympiad (IMO) problems despite performing well on other mathematical benchmarks like AIME. The motivation is to explore whether advanced LLMs can tackle these uniquely challenging problems that require deep insight, creativity, and formal reasoning.

Method: The researchers used Google’s Gemini 2.5 Pro with a self-verification pipeline and careful prompt design on newly released IMO 2025 problems to avoid data contamination issues.

Result: The model successfully solved 5 out of 6 IMO 2025 problems correctly (with some caveats mentioned), demonstrating substantial improvement in handling Olympiad-level mathematical reasoning tasks.

Conclusion: The results highlight the importance of developing optimal strategies and methodologies to fully harness the potential of powerful LLMs for complex mathematical reasoning tasks, showing promising progress in AI mathematical problem-solving capabilities.

Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google’s Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly (up to a caveat discussed below). This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.

cs.SD

[289] Nonlinear Framework for Speech Bandwidth Extension

Tarikul Islam Tamiti, Nursad Mamun, Anomadarshi Barua

Main category: cs.SD

TL;DR: NDSI-BWE introduces a novel adversarial bandwidth extension framework with seven specialized discriminators inspired by nonlinear dynamical systems and a complex-valued ConformerNeXt generator, achieving state-of-the-art performance with 8x parameter reduction through depth-wise convolution.

Details

Motivation: High-frequency components are often lost due to bandwidth constraints in applications like telecommunications and high-fidelity audio systems with limited resources, creating a need for effective bandwidth extension methods to recover these crucial frequency components.

Method: The paper proposes NDSI-BWE, an adversarial framework featuring seven discriminators inspired by nonlinear dynamical systems (MRLD for chaos detection, MS-RD for recurrence dynamics, MSDFA for scale-invariant relationships, MR-PPD for latent space relationships, MPD for cyclical patterns, MRAD and MRPD for amplitude-phase statistics) combined with a complex-valued ConformerNeXt generator using dual-stream Lattice-Net architecture and depth-wise convolution for parameter efficiency.

Result: NDSI-BWE achieves state-of-the-art performance in bandwidth extension across six objective evaluation metrics and subjective tests with five human judges, while reducing parameters by eight times compared to existing methods through the use of depth-wise convolution.

Conclusion: The integration of multiple discriminators inspired by nonlinear dynamical systems with an efficient ConformerNeXt-based generator successfully establishes a new state-of-the-art in bandwidth extension, demonstrating that combining diverse temporal behavior modeling with parameter-efficient architectures can significantly improve high-frequency component recovery.

Abstract: Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincar'e Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer’s global dependency modeling and ConvNeXt block’s local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

[290] A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio)

David Fiala, Laurent Pugin, Marnix van Berchum, Martha Thomae, Kévin Roger

Main category: cs.SD

TL;DR: The Ricercar Lab made available 3500 XML files of 15th-century music encoded in CMME format and developed conversion tools to transform them into modern MEI standards, enabling better accessibility and use in contemporary music software.

Details

Motivation: The existing CMME-encoded corpus of 15th-century music manuscripts needed to be converted to more up-to-date MEI standards to improve accessibility and compatibility with modern music encoding tools, as the original CMME format and tools had not been updated since the 2000s.

Method: A workshop was organized in Paris with experts in mensural music notation, XML formats, and programming. A converter was developed directly in the open-source Verovio library to convert CMME files to MEI mensural format, followed by implementation of conversion to MEI CMN for compatibility with common music software like MuseScore.

Result: Successfully developed conversion tools that allow CMME-XML files to be imported directly into Verovio and converted to MEI formats with minimal information loss. The converter enables loading of these historical music files in modern engraving software and provides a new pipeline for encoding and editing mensural music.

Conclusion: The conversion tools give new life to the existing CMME corpus by making it compatible with modern standards, while also providing a valuable pipeline for future mensural music encoding work, bridging the gap between historical music encoding formats and contemporary tools.

Abstract: The Ricercar Lab - the musicological research team at the Center for advanced Studies in the Renaissance at the University of Tours - has decided to make available in open access, thanks to the support of the French digital infrastructure Biblissima, a large corpus of about 3500 XML files of 15th-c. music. This corpus was produced by the German musicologist Clemens Goldberg who encoded since 2010 onwards the musical content of 34 major 15th-c. music manuscripts and other complementary files, in order to offer on his foundation’s website PDF files of complete collections of works by Du Fay, Binchois, Okeghem, Busnoys and most of their major contemporaries, focusing on their secular output. This corpus was encoded in an XML format named CMME (Computerized Mensural Music Editing), specifically conceived for mensural music by Theodor Dumitrescu in the 2000s, together with editorial and publication tools which have not been updated since then. This article focuses on the development of a set of conversion tools for these CMME files to meet more up-to-date standards of music encoding, namely MEI. A workshop was organised in September 2024 at the Campus Condorcet in Paris, gathering experts with a wide range of knowledge on mensural music notation, XML formats and programming. A converter was developped directly in the open-source rendering library Verovio, allowing the conversion from CMME to MEI mensural. A conversion to MEI CMN was implemented afterwards, enabling to load these files in common engraving softwares such as MuseScore with minimal loss of information. With the availability of a direct import of CMME-XML into Verovio, the corpus of existing CMME files gets a new life. Furthermore, since the stand-alone CMME editor still works fine and no alternative is available yet for native MEI, the converter offers a new pipeline for encoding and editing mensural music.

[291] SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov

Main category: cs.SD

TL;DR: This paper introduces SDBench, an open-source benchmark suite for speaker diarization that integrates 13 diverse datasets and enables consistent evaluation of different systems. The authors also present SpeakerKit, which achieves 9.6x speedup over Pyannote v3 with comparable accuracy.

Details

Motivation: Current speaker diarization systems show high variance in error rates across different datasets and domains, and comparing systems requires careful application of best practices for fair evaluation. There's a need for standardized benchmarking tools that enable reproducible and consistent evaluation of speaker diarization performance.

Method: The authors developed SDBench, an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent analysis. They also created SpeakerKit, an inference efficiency-focused system built on Pyannote v3, and conducted ablation studies to optimize performance. The benchmark evaluates 6 state-of-the-art systems including commercial APIs.

Result: SpeakerKit achieved 9.6x faster inference speed compared to Pyannote v3 while maintaining comparable error rates. The benchmark revealed important trade-offs between accuracy and speed across different state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API.

Conclusion: SDBench provides a standardized platform for reproducible speaker diarization evaluation that can accommodate new systems over time. The benchmark successfully enabled rapid optimization of SpeakerKit and revealed performance trade-offs in existing systems, demonstrating its value for the speaker diarization research community.

Abstract: Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.

[292] LABNet: A Lightweight Attentive Beamforming Network for Ad-hoc Multichannel Microphone Invariant Real-Time Speech Enhancement

Haoyin Yan, Jie Zhang, Chengqian Jiang, Shuang Zhang

Main category: cs.SD

TL;DR: The paper proposes LABNet, a lightweight attentive beamforming network for multichannel speech enhancement that maintains microphone invariance while achieving real-time performance with minimal computational overhead for edge devices.

Details

Motivation: The need for multichannel speech enhancement systems that can handle varying microphone numbers and array geometries (microphone invariance) while being computationally efficient enough for real-time edge-device applications, as traditional multichannel approaches increase computational burden.

Method: A three-stage framework featuring efficient intra-channel modeling and inter-channel interaction, with a cross-channel attention module that selectively aggregates features from each channel to achieve lightweight attentive beamforming.

Result: LABNet achieves impressive speech enhancement performance while maintaining microphone invariance and requiring ultra-light resource overhead, demonstrating suitability for real-time applications.

Conclusion: The proposed LABNet successfully integrates microphone invariance into a low-complexity real-time speech enhancement system, showing great potential for ad-hoc array processing applications on edge devices.

Abstract: Multichannel speech enhancement (SE) aims to restore clean speech from noisy measurements by leveraging spatiotemporal signal features. In ad-hoc array conditions, microphone invariance (MI) requires systems to handle different microphone numbers and array geometries. From a practical perspective, multichannel recordings inevitably increase the computational burden for edge-device applications, highlighting the necessity of lightweight and efficient deployments. In this work, we propose a lightweight attentive beamforming network (LABNet) to integrate MI in a low-complexity real-time SE system. We design a three-stage framework for efficient intra-channel modeling and inter-channel interaction. A cross-channel attention module is developed to aggregate features from each channel selectively. Experimental results demonstrate our LABNet achieves impressive performance with ultra-light resource overhead while maintaining the MI, indicating great potential for ad-hoc array processing.

[293] TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang

Main category: cs.SD

TL;DR: A cascaded method for generating multisource binaural audio from text that uses LLMs for temporal-spatial structuring, mono audio generation, and binaural rendering to create immersive spatial audio experiences.

Details

Motivation: Most existing text-to-audio generation methods only produce mono outputs, lacking essential spatial information needed for immersive auditory experiences with proper spatial positioning of sound sources.

Method: A cascaded approach involving: (1) using a pretrained LLM to segment text into structured format with temporal and spatial details for each sound event, (2) generating multiple mono audios with varying durations using a pretrained mono audio generation network, (3) transforming mono audios to binaural using a neural binaural rendering network based on LLM spatial data, and (4) arranging binaural audios by start times to create final multisource binaural output.

Result: Experimental results show the proposed method achieves superior performance in both audio generation quality and spatial perceptual accuracy compared to existing approaches.

Conclusion: The cascaded TTMBA method successfully addresses the limitation of mono-only text-to-audio generation by incorporating spatial information and temporal control, demonstrating improved audio quality and spatial accuracy for immersive audio experiences.

Abstract: Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

[294] LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi

Main category: cs.SD

TL;DR: LENS-DF is a comprehensive training and evaluation recipe for audio deepfake detection that generates realistic audio conditions (longer duration, noise, multiple speakers) and demonstrates superior performance compared to conventional training methods.

Details

Motivation: Existing audio deepfake detection methods lack robustness under realistic and complicated audio conditions such as longer duration, noisy environments, and multiple speakers, necessitating a more comprehensive training and evaluation approach.

Method: LENS-DF recipe that generates controllable audio datasets with critical characteristics (longer duration, noisy conditions, multiple speakers) and uses self-supervised learning front-end with simple back-end models for detection and temporal localization.

Result: Models trained with LENS-DF consistently outperform those trained via conventional recipes, demonstrating improved robustness for audio deepfake detection and localization under realistic conditions.

Conclusion: LENS-DF provides an effective and useful framework for robust audio deepfake detection and localization, with ablation studies confirming the impact and relevance of the introduced variations to realistic field challenges.

Abstract: This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.

[295] Robust Bioacoustic Detection via Richly Labelled Synthetic Soundscape Augmentation

Kaspar Soltero, Tadeu Siqueira, Stefanie Gutschmidt

Main category: cs.SD

TL;DR: This study presents a synthetic data framework that generates large volumes of labeled training data for bioacoustic detection models from minimal source material, significantly reducing manual labeling effort in Passive Acoustic Monitoring (PAM) analysis.

Details

Motivation: Passive Acoustic Monitoring (PAM) analysis is severely limited by the intensive manual effort required to create labeled training data, which creates a major bottleneck in computational bioacoustics and ecological assessment.

Method: The framework synthesizes realistic soundscapes by combining clean background noise with isolated target vocalizations (little owl calls) and automatically generates dynamic labels like bounding boxes during the synthesis process. Models are then fine-tuned on this synthetic data.

Result: The model fine-tuned on synthetic data generalized well to real-world soundscapes and maintained high performance even when the diversity of source vocalizations was drastically reduced, indicating successful learning of generalized features without overfitting.

Conclusion: Synthetic data generation is a highly effective strategy for training robust bioacoustic detectors from small source datasets, significantly reducing manual labeling effort and overcoming key bottlenecks in computational bioacoustics while enhancing ecological assessment capabilities.

Abstract: Passive Acoustic Monitoring (PAM) analysis is often hindered by the intensive manual effort needed to create labelled training data. This study introduces a synthetic data framework to generate large volumes of richly labelled training data from very limited source material, improving the robustness of bioacoustic detection models. Our framework synthesises realistic soundscapes by combining clean background noise with isolated target vocalisations (little owl), automatically generating dynamic labels like bounding boxes during synthesis. A model fine-tuned on this data generalised well to real-world soundscapes, with performance remaining high even when the diversity of source vocalisations was drastically reduced, indicating the model learned generalised features without overfitting. This demonstrates that synthetic data generation is a highly effective strategy for training robust bioacoustic detectors from small source datasets. The approach significantly reduces manual labelling effort, overcoming a key bottleneck in computational bioacoustics and enhancing ecological assessment capabilities.

[296] SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

Jinbo Hu, Yin Cao, Ming Wu, Feiran Yang, Jun Yang

Main category: cs.SD

TL;DR: SALM is a novel framework that bridges spatial audio and language through multi-modal contrastive learning, enabling better understanding and editing of spatial acoustic environments using text-based controls.

Details

Motivation: Existing audio-language models struggle with processing spatial audio and perceiving spatial acoustic scenes, creating a need for better spatial audio understanding in AI systems.

Method: SALM uses a dual-branch audio encoder with text encoder, decomposing spatial sound into semantic and spatial components through structured audio embeddings via multi-modal contrastive learning for seamless alignment of spatial and text representations.

Result: SALM effectively captures and aligns cross-modal representations, supports zero-shot direction classification, and enables advanced editing capabilities like altering directional audio using text-based embeddings.

Conclusion: SALM successfully bridges the gap between spatial audio and language understanding, providing robust support for spatial audio editing and demonstrating effective cross-modal representation alignment.

Abstract: Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio-language models struggle with processing spatial audio and perceiving spatial acoustic scenes. We introduce the Spatial Audio Language Model (SALM), a novel framework that bridges spatial audio and language via multi-modal contrastive learning. SALM consists of a text encoder and a dual-branch audio encoder, decomposing spatial sound into semantic and spatial components through structured audio embeddings. Key features of SALM include seamless alignment of spatial and text representations, separate and joint extraction of spatial and semantic information, zero-shot direction classification and robust support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns cross-modal representations. Furthermore, it supports advanced editing capabilities, such as altering directional audio using text-based embeddings.

Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin

Main category: cs.SD

TL;DR: DASM is a query-based framework for open-vocabulary sound event detection that uses multi-modal queries (text/audio) and a dual-stream decoder to achieve better generalization to novel sound classes while maintaining localization accuracy.

Details

Motivation: Existing sound event detection algorithms are limited by closed-set assumptions and can only detect predefined classes. Recent zero-shot SED methods using audio-language models show poor performance due to lack of fine-grained alignment and cross-modal feature fusion.

Method: DASM formulates SED as a frame-level retrieval task with query vectors from text/audio prompts. It uses a dual-stream decoder that separates event recognition (cross-modality event decoder) and temporal localization (context network), plus an inference-time attention masking strategy to leverage semantic relations between base and novel classes.

Result: On AudioSet Strong: outperforms CLAP-based methods by +7.8 PSDS in open-vocabulary setting and baseline by +6.9 PSDS in closed-set. On DESED cross-dataset evaluation: achieves 42.2 PSDS1, exceeding supervised CRNN baseline.

Conclusion: DASM successfully balances localization accuracy with generalization to novel classes, demonstrating superior performance in both open-vocabulary and closed-set sound event detection tasks compared to existing methods.

Abstract: Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

[298] ReMi: A Random Recurrent Neural Network Approach to Music Production

Hugo Chateau-Laurent, Tara Vanhatalo, Wei-Tung Pan, Xavier Hinaut

Main category: cs.SD

TL;DR: This paper presents a data-free approach using randomly initialized recurrent neural networks to generate musical arpeggios and oscillations, offering a computationally efficient alternative to generative AI that enhances rather than replaces musician creativity.

Details

Motivation: The paper is motivated by concerns about generative AI including high energy consumption, copyright infringement issues, and the potential for creative atrophy among musicians. The authors seek an alternative approach that supports rather than replaces human creativity.

Method: The method uses randomly initialized recurrent neural networks (RNNs) to produce musical content, specifically arpeggios and low-frequency oscillations. This approach requires no training data and is computationally lightweight compared to traditional generative AI models.

Result: The randomly initialized RNNs successfully generate rich and configurable arpeggios and low-frequency oscillations, demonstrating that meaningful musical content can be produced without data training or extensive computational resources.

Conclusion: The paper concludes that randomly initialized neural networks offer a viable alternative to data-hungry generative AI for music creation, providing a tool that expands musicians’ creativity rather than replacing them, while addressing concerns about energy consumption and copyright issues.

Abstract: Generative artificial intelligence raises concerns related to energy consumption, copyright infringement and creative atrophy. We show that randomly initialized recurrent neural networks can produce arpeggios and low-frequency oscillations that are rich and configurable. In contrast to end-to-end music generation that aims to replace musicians, our approach expands their creativity while requiring no data and much less computational power. More information can be found at: https://allendia.com/

[299] Audio Geolocation: A Natural Sounds Benchmark

Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn

Main category: cs.SD

TL;DR: This paper explores global-scale audio geolocation by converting wildlife audio to spectrograms and using species vocalizations as geographic cues, demonstrating that acoustic signals can determine location and showing improved results when combining audio with visual content.

Details

Motivation: The challenge of determining someone's geographic location purely from acoustic signals they hear, investigating whether audio alone can provide sufficient information for localization at country, state, or city levels.

Method: Vision-inspired approach converting audio recordings to spectrograms, benchmarking existing image geolocation techniques on wildlife audio from iNatSounds dataset, integrating species range prediction with retrieval-based geolocation, and exploring multimodal approaches combining audio and visual content from movies.

Result: Species vocalizations provide strong geolocation cues due to their defined geographic ranges, geolocation accuracy improves when analyzing species-rich recordings and when aggregating across spatiotemporal neighborhoods, and multimodal approaches combining audio and visual cues show advantages over audio-only methods.

Conclusion: Audio signals can be effectively used for geographic localization, with species vocalizations serving as reliable location indicators. The integration of audio and visual cues enhances geolocation performance, establishing a foundation for future research in audio-based geolocation systems.

Abstract: Can we determine someone’s geographic location purely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? We tackle the challenge of global-scale audio geolocation, formalize the problem, and conduct an in-depth analysis with wildlife audio from the iNatSounds dataset. Adopting a vision-inspired approach, we convert audio recordings to spectrograms and benchmark existing image geolocation techniques. We hypothesize that species vocalizations offer strong geolocation cues due to their defined geographic ranges and propose an approach that integrates species range prediction with retrieval-based geolocation. We further evaluate whether geolocation improves when analyzing species-rich recordings or when aggregating across spatiotemporal neighborhoods. Finally, we introduce case studies from movies to explore multimodal geolocation using both audio and visual content. Our work highlights the advantages of integrating audio and visual cues, and sets the stage for future research in audio geolocation.

cs.LG

Jun-Wei Zeng, Jerry Shen

Main category: cs.LG

TL;DR: This paper introduces CAPS, a multi-modal framework that quantitatively models college admissions by decomposing applicant profiles into three interpretable components (academic performance, essay quality, and extracurricular engagement) using transformer embeddings, LLM scoring, and XGBoost regression to provide transparent and explainable evaluations.

Details

Motivation: Traditional holistic college admissions reviews suffer from opacity, inconsistency, and create anxiety for applicants. There is a need for more transparent, explainable, and data-informed admissions practices that can quantitatively model holistic evaluations while maintaining interpretability.

Method: The paper develops CAPS (Comprehensive Applicant Profile Score) framework that decomposes applicant profiles into three components: Standardized Academic Score (SAS), Essay Quality Index (EQI), and Extracurricular Impact Score (EIS). The method uses transformer-based semantic embeddings, LLM scoring, and XGBoost regression to create transparent and explainable evaluations.

Result: The experiments on a synthetic but realistic dataset show strong performance with EQI prediction R² of 0.80, classification accuracy over 75%, macro F1 score of 0.69, and weighted F1 score of 0.74. The results demonstrate that CAPS can effectively model holistic admissions evaluations while providing interpretability.

Conclusion: CAPS successfully addresses key limitations in traditional holistic review by providing opacity reduction, consistency improvement, and anxiety mitigation for applicants. The framework paves the way for more equitable and data-informed admissions practices while maintaining the holistic nature of college admissions evaluations.

Abstract: This paper introduces the Comprehensive Applicant Profile Score (CAPS), a novel multi-modal framework designed to quantitatively model and interpret holistic college admissions evaluations. CAPS decomposes applicant profiles into three interpretable components: academic performance (Standardized Academic Score, SAS), essay quality (Essay Quality Index, EQI), and extracurricular engagement (Extracurricular Impact Score, EIS). Leveraging transformer-based semantic embeddings, LLM scoring, and XGBoost regression, CAPS provides transparent and explainable evaluations aligned with human judgment. Experiments on a synthetic but realistic dataset demonstrate strong performance, achieving an EQI prediction R^2 of 0.80, classification accuracy over 75%, a macro F1 score of 0.69, and a weighted F1 score of 0.74. CAPS addresses key limitations in traditional holistic review – particularly the opacity, inconsistency, and anxiety faced by applicants – thus paving the way for more equitable and data-informed admissions practices.

[301] RDMA: Cost Effective Agent-Driven Rare Disease Discovery within Electronic Health Record Systems

John Wu, Adam Cross, Jimeng Sun

Main category: cs.LG

TL;DR: RDMA is a framework that identifies rare diseases in electronic health records by mimicking medical expert reasoning, handling clinical abbreviations and implicit patterns while running locally to preserve privacy and improve diagnostic accuracy.

Details

Motivation: Rare diseases affect 1 in 10 Americans but are poorly captured by standard ICD coding systems in EHRs, with crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, have privacy concerns with cloud processing, and lack clinical reasoning abilities.

Method: RDMA framework mirrors how medical experts identify rare disease patterns by connecting scattered clinical observations that together suggest specific rare conditions. It handles clinical abbreviations, recognizes implicit disease patterns, and applies contextual reasoning locally on standard hardware.

Result: RDMA improves F1 performance by upwards of 30% and decreases inference costs 10-fold compared to existing approaches. It reduces privacy risks by processing data locally while effectively extracting rare disease information from EHR systems.

Conclusion: RDMA enables clinicians to access key rare disease information from EHR systems while avoiding privacy risks of cloud services, supporting earlier diagnosis for rare disease patients through improved local processing capabilities.

Abstract: Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at https://github.com/jhnwu3/RDMA.

[302] An open dataset of neural networks for hypernetwork research

David Kurtenbach, Lior Shamir

Main category: cs.LG

TL;DR: Researchers created a dataset of 10,000 trained LeNet-5 neural networks organized into 10 classes for binary image classification, designed specifically to enable hypernetwork research where AI systems can generate other neural networks.

Details

Motivation: The field of hypernetworks (neural networks that generate other neural networks) has been understudied due to lack of available research resources and datasets specifically designed for this purpose.

Method: Generated 10,000 LeNet-5 neural networks using a computing cluster of over 10,000 cores, organizing them into 10 classes with 1,000 networks each for binary classification of ImageNette V2 classes, then tested basic classification accuracy on the network weights themselves.

Result: Achieved 72.0% classification accuracy when using supervised machine learning to classify the neural networks themselves, demonstrating that meaningful differences between network weights can be identified algorithmically.

Conclusion: The dataset successfully enables hypernetwork research by providing a large collection of diverse but structured neural networks, with both the dataset and generation code made publicly available to advance research in this understudied area.

Abstract: Despite the transformative potential of AI, the concept of neural networks that can produce other neural networks by generating model weights (hypernetworks) has been largely understudied. One of the possible reasons is the lack of available research resources that can be used for the purpose of hypernetwork research. Here we describe a dataset of neural networks, designed for the purpose of hypernetworks research. The dataset includes $10^4$ LeNet-5 neural networks trained for binary image classification separated into 10 classes, such that each class contains 1,000 different neural networks that can identify a certain ImageNette V2 class from all other classes. A computing cluster of over $10^4$ cores was used to generate the dataset. Basic classification results show that the neural networks can be classified with accuracy of 72.0%, indicating that the differences between the neural networks can be identified by supervised machine learning algorithms. The ultimate purpose of the dataset is to enable hypernetworks research. The dataset and the code that generates it are open and accessible to the public.

[303] Prompt Smart, Pay Less: Cost-Aware APO for Real-World Applications

Jayesh Choudhari, Piyush Kumar Singh, Douglas McIlwraith, Snehal Nair

Main category: cs.LG

TL;DR: This paper presents the first comprehensive evaluation of Automatic Prompt Optimization (APO) methods for real-world commercial multiclass classification, introducing APE-OPRO, a hybrid framework that achieves 18% better cost-efficiency than OPRO while maintaining performance.

Details

Motivation: Prompt design for Large Language Models remains largely heuristic, manual, and difficult to scale. Most existing APO frameworks have only been validated on benchmark classification tasks of limited complexity, creating a critical gap for real-world, high-stakes commercial applications.

Method: The authors introduce APE-OPRO, a novel hybrid framework combining the complementary strengths of APE and OPRO methods. They benchmark this against both gradient-free (APE, OPRO) and gradient-based (ProTeGi) methods on a dataset of ~2,500 labeled products, conducting ablation studies on depth and breadth hyperparameters.

Result: APE-OPRO achieved approximately 18% improvement in cost-efficiency over OPRO without sacrificing performance. ProTeGi offered the strongest absolute performance at lower API cost but higher computational time. The study revealed notable sensitivity to label formatting and identified key trade-offs between performance, API efficiency, and scalability.

Conclusion: The findings provide actionable insights for implementing APO in commercial applications and establish a foundation for future research in multi-label, vision, and multimodal prompt optimization scenarios. APE-OPRO strikes a compelling balance between performance, API efficiency, and scalability for real-world deployment.

Abstract: Prompt design is a critical factor in the effectiveness of Large Language Models (LLMs), yet remains largely heuristic, manual, and difficult to scale. This paper presents the first comprehensive evaluation of Automatic Prompt Optimization (APO) methods for real-world, high-stakes multiclass classification in a commercial setting, addressing a critical gap in the existing literature where most of the APO frameworks have been validated only on benchmark classification tasks of limited complexity. We introduce APE-OPRO, a novel hybrid framework that combines the complementary strengths of APE and OPRO, achieving notably better cost-efficiency, around $18%$ improvement over OPRO, without sacrificing performance. We benchmark APE-OPRO alongside both gradient-free (APE, OPRO) and gradient-based (ProTeGi) methods on a dataset of ~~2,500 labeled products. Our results highlight key trade-offs: ProTeGi offers the strongest absolute performance at lower API cost but higher computational time as noted in~~\cite{protegi}, while APE-OPRO strikes a compelling balance between performance, API efficiency, and scalability. We further conduct ablation studies on depth and breadth hyperparameters, and reveal notable sensitivity to label formatting, indicating implicit sensitivity in LLM behavior. These findings provide actionable insights for implementing APO in commercial applications and establish a foundation for future research in multi-label, vision, and multimodal prompt optimization scenarios.

[304] Bipartite Patient-Modality Graph Learning with Event-Conditional Modelling of Censoring for Cancer Survival Prediction

Hailin Yue, Hulin Kuang, Jin Liu, Junjian Li, Lanlan Wang, Mengshen He, Jianxin Wang

Main category: cs.LG

TL;DR: CenSurv is a novel cancer survival prediction method that uses bipartite patient-modality graph learning and event-conditional modeling of censoring to better utilize censored data and handle missing modalities, achieving 3.1% improvement over state-of-the-art methods.

Details

Motivation: Existing cancer survival prediction studies focus only on samples with known survival risks without fully leveraging censored samples, suffer performance degradation in modality-missing scenarios, and struggle during inference. There's a need for better utilization of censored data and robust handling of missing modalities.

Method: The paper proposes CenSurv with three key components: (1) bipartite graph structure to model multimodal data and obtain representations, (2) complete-incomplete alignment strategy to explore modality-agnostic features for handling missing modalities, and (3) plug-and-play event-conditional modeling of censoring (ECMC) that uses dynamic momentum accumulation confidences to select reliable censored data and assign more accurate survival times.

Result: CenSurv outperforms the best state-of-the-art methods by 3.1% in mean C-index across 5 public cancer datasets, shows excellent robustness under various modality-missing scenarios, and the ECMC module alone improves 8 baseline methods by 1.3% mean C-index improvement across 5 datasets.

Conclusion: The proposed CenSurv method successfully addresses the limitations of existing cancer survival prediction approaches by effectively utilizing censored data and handling missing modalities through bipartite graph learning and event-conditional censoring modeling, demonstrating superior performance and robustness across multiple datasets.

Abstract: Accurately predicting the survival of cancer patients is crucial for personalized treatment. However, existing studies focus solely on the relationships between samples with known survival risks, without fully leveraging the value of censored samples. Furthermore, these studies may suffer performance degradation in modality-missing scenarios and even struggle during the inference process. In this study, we propose a bipartite patient-modality graph learning with event-conditional modelling of censoring for cancer survival prediction (CenSurv). Specifically, we first use graph structure to model multimodal data and obtain representation. Then, to alleviate performance degradation in modality-missing scenarios, we design a bipartite graph to simulate the patient-modality relationship in various modality-missing scenarios and leverage a complete-incomplete alignment strategy to explore modality-agnostic features. Finally, we design a plug-and-play event-conditional modeling of censoring (ECMC) that selects reliable censored data using dynamic momentum accumulation confidences, assigns more accurate survival times to these censored data, and incorporates them as uncensored data into training. Comprehensive evaluations on 5 publicly cancer datasets showcase the superiority of CenSurv over the best state-of-the-art by 3.1% in terms of the mean C-index, while also exhibiting excellent robustness under various modality-missing scenarios. In addition, using the plug-and-play ECMC module, the mean C-index of 8 baselines increased by 1.3% across 5 datasets. Code of CenSurv is available at https://github.com/yuehailin/CenSurv.

[305] ReDi: Rectified Discrete Flow

Jaehoon Yoo, Wonjung Kim, Seunghoon Hong

Main category: cs.LG

TL;DR: This paper introduces Rectified Discrete Flow (ReDi), a method that reduces the approximation error in Discrete Flow-based Models through iterative coupling rectification, enabling efficient few-step generation while maintaining high-quality discrete data synthesis.

Details

Motivation: Discrete Flow-based Models suffer from slow sampling speeds due to their reliance on iterative decoding processes, which stems from factorization approximation necessary for handling high-dimensional data. The authors aim to address this computational bottleneck while preserving generation quality.

Method: The authors propose Rectified Discrete Flow (ReDi), an iterative method that reduces factorization error by rectifying the coupling between source and target distributions. They use Conditional Total Correlation (TC) to characterize approximation error and design ReDi to monotonically decrease this measure at each step.

Result: ReDi significantly reduces Conditional TC and enables few-step generation. The rectified couplings are also well-suited for training efficient one-step models on image generation tasks, demonstrating improved efficiency without sacrificing quality.

Conclusion: ReDi provides a theoretically grounded and simple approach for tackling the few-step generation challenge in discrete flow models, offering a new perspective on efficient discrete data synthesis with guaranteed convergence properties.

Abstract: Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at https://github.com/Ugness/ReDi_discrete

Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu

Main category: cs.LG

TL;DR: FISHER is a foundation model that unifies analysis of heterogeneous industrial signals (M5 problem) using STFT sub-bands and teacher-student SSL framework, achieving 5.03% performance gain over existing SSL models on industrial health management tasks.

Details

Motivation: Industrial SCADA systems generate highly heterogeneous signals (M5 problem) that existing specialized models handle separately, missing opportunities to leverage synergies between modalities and scaling laws for unified signal analysis and anomaly detection.

Method: FISHER uses STFT sub-bands as modeling units to handle arbitrary sampling rates, treats sampling rate increments as sub-band information concatenation, and employs a teacher-student self-supervised learning (SSL) framework for pre-training on multi-modal industrial signals.

Result: FISHER demonstrates up to 5.03% general performance improvement over top SSL models on the RMIS benchmark across multiple health management tasks, with more efficient scaling curves and versatile capabilities for industrial signal representation.

Conclusion: The intrinsic similarity of M5 industrial signals enables unified modeling through foundation models, with FISHER proving that a single model can effectively handle heterogeneous industrial signals while providing superior performance and scaling efficiency compared to specialized approaches.

Abstract: With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on https://github.com/jianganbai/FISHER

[307] Improving the Generation of VAEs with High Dimensional Latent Spaces by the use of Hyperspherical Coordinates

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

Main category: cs.LG

TL;DR: This paper addresses the issue of poor generation quality in VAEs when sampling from random latent vectors by proposing hyperspherical coordinates to compress latent vectors towards a hypersphere island, reducing latent sparsity and improving generation ability.

Details

Motivation: Standard VAEs suffer from poor generation quality when decoding random latent vectors from the prior, especially in high-dimensional latent spaces (>12 dimensions). The authors identify that latent vectors are distributed uniformly on a hypersphere, leading to sparsity issues that affect meaningful data generation.

Method: The paper proposes reformulating VAE latent variables using hyperspherical coordinates instead of standard Cartesian coordinates. This approach compresses latent vectors towards an “island” on the hypersphere, creating a new parameterization of the latent space with limited computational overhead.

Result: The hyperspherical coordinate formulation reduces latent sparsity and demonstrates improved generation ability compared to standard VAEs. The method shows better performance in generating meaningful data when sampling from the prior distribution.

Conclusion: By leveraging insights from high-dimensional statistics and reformulating the latent space using hyperspherical coordinates, VAEs can achieve better generation quality with reduced latent sparsity, offering a computationally efficient solution to improve VAE performance in high-dimensional settings.

Abstract: Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, decoding a random latent vector from the prior usually does not produce meaningful data, at least when the latent space has more than a dozen dimensions. In this paper, we investigate this issue by drawing insight from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are by construction distributed uniformly on a hypersphere. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards an island on the hypersphere, thereby reducing the latent sparsity and we show that this improves the generation ability of the VAE. We propose a new parameterization of the latent space with limited computational overhead.

[308] Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping

Srivatsan Krishnan, Jason Jabbour, Dan Zhang, Natasha Jaques, Aleksandra Faust, Shayegan Omidshafiei, Vijay Janapa Reddi

Main category: cs.LG

TL;DR: A decentralized multi-agent reinforcement learning framework for DNN-to-hardware mapping that uses agent clustering to improve sample efficiency by 30-300x over single-agent RL while achieving significant latency and energy reductions.

Details

Motivation: Mapping deep neural networks to hardware is critical for optimizing performance, but the vast and complex mapping space makes traditional reinforcement learning approaches suffer from sample inefficiency, limiting their effectiveness in accelerator design.

Method: A decentralized multi-agent reinforcement learning (MARL) framework with an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis, enabling parallelized learning while avoiding training inefficiencies.

Result: The MARL approach achieves 30-300x improvement in sample efficiency compared to standard single-agent RL, with up to 32.61x latency reduction and 16.45x energy-delay product (EDP) reduction under iso-sample conditions.

Conclusion: The decentralized MARL framework with agent clustering successfully addresses the sample inefficiency problem in DNN-to-hardware mapping, significantly accelerating exploration and achieving substantial improvements in both performance metrics and learning efficiency.

Abstract: Mapping deep neural networks (DNNs) to hardware is critical for optimizing latency, energy consumption, and resource utilization, making it a cornerstone of high-performance accelerator design. Due to the vast and complex mapping space, reinforcement learning (RL) has emerged as a promising approach-but its effectiveness is often limited by sample inefficiency. We present a decentralized multi-agent reinforcement learning (MARL) framework designed to overcome this challenge. By distributing the search across multiple agents, our framework accelerates exploration. To avoid inefficiencies from training multiple agents in parallel, we introduce an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis. This enables a decentralized, parallelized learning process that significantly improves sample efficiency. Experimental results show our MARL approach improves sample efficiency by 30-300x over standard single-agent RL, achieving up to 32.61x latency reduction and 16.45x energy-delay product (EDP) reduction under iso-sample conditions.

[309] Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor

Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li

Main category: cs.LG

TL;DR: HalMit is a black-box watchdog framework that detects hallucinations in LLM-powered intelligent agents by modeling their generalization bounds using probabilistic fractal sampling, without requiring internal access to the LLM architecture.

Details

Motivation: LLM-powered intelligent agents suffer from hallucinations that undermine their credibility and pose catastrophic risks in real-world deployment. Existing hallucination detection methods either require white-box access to LLMs or fail to accurately identify hallucinations, creating a need for effective black-box detection approaches.

Method: The paper proposes HalMit, a black-box watchdog framework that models the generalization bound of LLM-empowered agents. It uses a probabilistic fractal sampling technique to generate sufficient queries in parallel that trigger incredible responses, efficiently identifying the generalization bound of the target agent without requiring internal LLM knowledge.

Result: Experimental evaluations show that HalMit significantly outperforms existing approaches in hallucination monitoring. The framework demonstrates superior performance while maintaining its black-box nature, making it practical for real-world applications.

Conclusion: HalMit presents a promising solution for enhancing the dependability of LLM-powered systems by providing effective hallucination detection without requiring white-box access. Its black-box nature and superior performance make it suitable for ensuring the reliability of intelligent agents in real-world deployments.

Abstract: Empowered by large language models (LLMs), intelligent agents have become a popular paradigm for interacting with open environments to facilitate AI deployment. However, hallucinations generated by LLMs-where outputs are inconsistent with facts-pose a significant challenge, undermining the credibility of intelligent agents. Only if hallucinations can be mitigated, the intelligent agents can be used in real-world without any catastrophic risk. Therefore, effective detection and mitigation of hallucinations are crucial to ensure the dependability of agents. Unfortunately, the related approaches either depend on white-box access to LLMs or fail to accurately identify hallucinations. To address the challenge posed by hallucinations of intelligent agents, we present HalMit, a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents and thus detect hallucinations without requiring internal knowledge of the LLM’s architecture. Specifically, a probabilistic fractal sampling technique is proposed to generate a sufficient number of queries to trigger the incredible responses in parallel, efficiently identifying the generalization bound of the target agent. Experimental evaluations demonstrate that HalMit significantly outperforms existing approaches in hallucination monitoring. Its black-box nature and superior performance make HalMit a promising solution for enhancing the dependability of LLM-powered systems.

[310] Fast-VAT: Accelerating Cluster Tendency Visualization using Cython and Numba

MSR Avinash, Ismael Lachheb

Main category: cs.LG

TL;DR: Fast-VAT is a high-performance Python reimplementation of the Visual Assessment of Cluster Tendency (VAT) algorithm that achieves up to 50x speedup while maintaining output fidelity through Numba JIT compilation and Cython optimizations.

Details

Motivation: The standard VAT algorithm implementation suffers from significant performance limitations due to O(n^2) time complexity and inefficient memory usage, making it impractical for larger datasets despite being a widely used unsupervised technique for assessing cluster structure.

Method: The authors developed Fast-VAT by reimplementing the VAT algorithm in Python with performance enhancements including Numba’s Just-In-Time (JIT) compilation and Cython’s static typing and low-level memory optimizations to address the computational bottlenecks.

Result: Fast-VAT achieves up to 50x speedup over the baseline implementation while preserving the output fidelity of the original VAT method. The approach was validated on real and synthetic datasets (Iris, Mall Customers, Spotify subsets) using Hopkins statistics, PCA, and t-SNE, with cluster tendency results confirmed against DBSCAN and K-Means clustering.

Conclusion: Fast-VAT successfully addresses the performance limitations of the original VAT algorithm, providing a practical and efficient solution for visual assessment of cluster tendency that maintains accuracy while dramatically improving computational speed, making it viable for larger datasets.

Abstract: Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba’s Just-In-Time (JIT) compilation and Cython’s static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets – including Iris, Mall Customers, and Spotify subsets – and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT’s structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.

[311] Foundation Models and Transformers for Anomaly Detection: A Survey

Mouïn Ben Ammar, Arturo Mendoza, Nacim Belkhir, Antoine Manzanera, Gianni Franchi

Main category: cs.LG

TL;DR: This survey examines how Transformers and foundation models are revolutionizing visual anomaly detection (VAD) by addressing key challenges like long-range dependencies and data scarcity through their global receptive fields and pre-training capabilities.

Details

Motivation: The motivation is to explore how advanced deep learning architectures, specifically Transformers and foundation models, can overcome traditional limitations in visual anomaly detection such as long-range dependency modeling, contextual understanding, and data scarcity issues that plague conventional approaches.

Method: The paper conducts a comprehensive survey categorizing VAD methods into three main approaches: reconstruction-based, feature-based, and zero/few-shot methods. It examines how Transformers and foundation models integrate attention mechanisms and leverage large-scale pre-training to enhance anomaly detection capabilities.

Result: The survey reveals that Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions compared to traditional methods. These architectures demonstrate superior performance in handling global context and adapting to various anomaly detection scenarios through their attention mechanisms and pre-training advantages.

Conclusion: Transformers and foundation models represent a paradigm shift in visual anomaly detection, offering significant improvements in robustness, interpretability, and scalability. The integration of attention mechanisms and large-scale pre-training provides effective solutions to longstanding challenges in VAD, establishing new state-of-the-art techniques in the field.

Abstract: In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.

[312] Towards Reliable, Uncertainty-Aware Alignment

Debangshu Banerjee, Kintan Saha, Aditya Gopalan

Main category: cs.LG

TL;DR: This paper addresses the instability in LLM alignment by proposing a variance-aware policy optimization framework that accounts for disagreement between reward models, leading to more robust alignment than standard methods.

Details

Motivation: Current LLM alignment methods rely on single reward model estimates, making them vulnerable to reward model inaccuracies. The authors observed substantial disagreement between independently trained reward models on the same preference dataset, highlighting instability in existing alignment strategies and the risk of performance degradation due to overfitting.

Method: The authors propose a variance-aware policy optimization framework that incorporates reward model variance estimates through a new policy regularizer. This framework accounts for the uncertainty and disagreement between different reward models during the alignment process.

Result: Experiments across diverse LLM and reward model configurations demonstrate that the variance-aware approach yields more stable and robust alignment compared to standard variance-unaware pipelines. The method provably reduces the risk of outputting worse policies than the default baseline.

Conclusion: Incorporating reward model variance into policy optimization significantly improves the stability and robustness of LLM alignment. The proposed framework provides a principled way to handle reward model uncertainty and reduces the risk of performance degradation in preference-based alignment.

Abstract: Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new policy regularizer that incorporates reward model variance estimates. We show that variance-aware policy optimization provably reduces the risk of outputting a worse policy than the default. Experiments across diverse LLM and reward model configurations confirm that our approach yields more stable and robust alignment than the standard (variance-unaware) pipeline.

[313] Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI

Alberto Messina

Main category: cs.LG

TL;DR: This paper proposes a “dual Turing test” framework where human judges try to identify AI rather than being deceived by it, formalized as an adversarial game with quality constraints and integrated into reinforcement learning alignment pipelines.

Details

Motivation: The motivation is to bridge three areas: flipped Turing test perspective, formal adversarial classification with quality guarantees, and RL alignment pipelines. This addresses the need for better AI detection and alignment methods by inverting the traditional Turing test paradigm.

Method: The method formalizes a dual Turing test as a two-player zero-sum game where judges identify AI over N rounds with fresh prompts from space Q. It incorporates quality function Q with parameters tau and delta, maps the minimax game onto RL-HF alignment loops using an undetectability detector D for negative rewards balanced by quality proxies, and employs phased difficulty levels with iterative adversarial training.

Result: The framework successfully combines quality thresholds, phased difficulty levels, and minimax bounds in a unified approach. It provides worst-case guarantees for adversarial classification and integrates undetectability detection with quality preservation in RL alignment pipelines.

Conclusion: The paper concludes by establishing a novel unified framework that connects inverted Turing tests, formal adversarial games, and RL alignment. The approach offers a structured way to balance AI undetectability with quality constraints and suggests immediate actions for implementation.

Abstract: In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the “dual Turing test”, in which a human judge’s goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge’s task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary’s feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.

[314] HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

Adrian Kaiser, Claudiu Leoveanu-Condrei, Ryan Gold, Marius-Constantin Dinu, Markus Hofmarcher

Main category: cs.LG

TL;DR: HyDRA is a hybrid neurosymbolic architecture that automates reliable Knowledge Graph construction by first building ontologies through collaborative agents using competency questions, then extracting triplets guided by verifiable contracts to address structural inconsistencies and improve output reliability.

Details

Motivation: Current automated Knowledge Graph construction faces significant challenges including output reliability, consistency, and verifiability issues that manifest as structural inconsistencies like isolated data islands and incorrect conflation of abstract classes with specific instances, creating a bottleneck for advancing neurosymbolic AI.

Method: HyDRA uses a two-stage approach: (1) collaborative neurosymbolic agents construct ontologies by agreeing on competency questions that define scope and requirements, (2) the resulting ontology graph guides automated triplet extraction from documents using design-by-contract principles with verifiable contracts to control Large Language Model generation processes.

Result: The approach produces Knowledge Graphs with improved structural consistency and reliability, with functional correctness verified through symbolic verification methods as described by the SymbolicAI framework, extending beyond standard benchmarks for evaluation.

Conclusion: HyDRA successfully addresses key challenges in automated KG construction by combining collaborative agent-based ontology construction with contract-driven triplet extraction, contributing both a hybrid architecture for improved reliability and novel evaluation methods for measuring functional integrity of Knowledge Graph outputs.

Abstract: The synergy between symbolic knowledge, often represented by Knowledge Graphs (KGs), and the generative capabilities of neural networks is central to advancing neurosymbolic AI. A primary bottleneck in realizing this potential is the difficulty of automating KG construction, which faces challenges related to output reliability, consistency, and verifiability. These issues can manifest as structural inconsistencies within the generated graphs, such as the formation of disconnected $\textit{isolated islands}$ of data or the inaccurate conflation of abstract classes with specific instances. To address these challenges, we propose HyDRA, a $\textbf{Hy}$brid-$\textbf{D}$riven $\textbf{R}$easoning $\textbf{A}$rchitecture designed for verifiable KG automation. Given a domain or an initial set of documents, HyDRA first constructs an ontology via a panel of collaborative neurosymbolic agents. These agents collaboratively agree on a set of competency questions (CQs) that define the scope and requirements the ontology must be able to answer. Given these CQs, we build an ontology graph that subsequently guides the automated extraction of triplets for KG generation from arbitrary documents. Inspired by design-by-contracts (DbC) principles, our method leverages verifiable contracts as the primary control mechanism to steer the generative process of Large Language Models (LLMs). To verify the output of our approach, we extend beyond standard benchmarks and propose an evaluation framework that assesses the functional correctness of the resulting KG by leveraging symbolic verifications as described by the neurosymbolic AI framework, $\textit{SymbolicAI}$. This work contributes a hybrid-driven architecture for improving the reliability of automated KG construction and the exploration of evaluation methods for measuring the functional integrity of its output. The code is publicly available.

[315] On the transferability of Sparse Autoencoders for interpreting compressed models

Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

Main category: cs.LG

TL;DR: This paper investigates how model compression affects interpretability in LLMs, finding that Sparse Autoencoders (SAEs) trained on original models can effectively interpret compressed models, and that pruning SAEs directly is as effective as retraining them, reducing computational costs.

Details

Motivation: Modern LLMs face inference efficiency challenges due to their large scale, leading to various compression methods like pruning and quantization. However, the impact of these compression techniques on model interpretability remains unclear, particularly for SAEs which are effective tools for understanding model activations.

Method: The researchers compare SAEs trained on original models versus compressed models, examining their ability to interpret model activations. They also explore directly pruning the original SAE as an alternative to retraining SAEs on compressed models.

Result: SAEs trained on original models can interpret compressed models with only slight performance degradation compared to SAEs trained specifically on compressed models. Additionally, pruning the original SAE directly achieves comparable performance to training a new SAE on the pruned model.

Conclusion: The study demonstrates that model compression has minimal impact on interpretability when using SAEs, and that direct SAE pruning can replace costly retraining, significantly reducing the computational overhead associated with SAE training for compressed models.

Abstract: Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model’s interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model’s activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.

[316] Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks

Kyung-hwan Lee, Kyung-tae Kim

Main category: cs.LG

TL;DR: The paper proposes SAL-GP, a multi-layer Gaussian Process framework that mirrors neural network architecture to improve confidence calibration by capturing both local semantic dependencies and global calibration coherence through layerwise corrections.

Details

Motivation: Conventional Gaussian Process calibration methods fail to capture the internal hierarchical structure of deep neural networks, which limits both interpretability and effectiveness for assessing predictive reliability in neural network classifiers.

Method: The authors develop a Semantic-Aware Layer-wise Gaussian Process (SAL-GP) framework that employs a multi-layer GP model where each layer’s feature representation is mapped to a local calibration correction. These layerwise GPs are coupled through a structured multi-layer kernel, enabling joint marginalization across all layers.

Result: The SAL-GP framework successfully captures both local semantic dependencies and global calibration coherence while consistently propagating predictive uncertainty through the network, leading to enhanced interpretability aligned with the network architecture.

Conclusion: SAL-GP provides a principled approach for evaluating confidence consistency and uncertainty quantification in deep models by mirroring the layered architecture of neural networks, offering improved calibration performance compared to conventional single global GP correction methods.

Abstract: Calibrating the confidence of neural network classifiers is essential for quantifying the reliability of their predictions during inference. However, conventional Gaussian Process (GP) calibration methods often fail to capture the internal hierarchical structure of deep neural networks, limiting both interpretability and effectiveness for assessing predictive reliability. We propose a Semantic-Aware Layer-wise Gaussian Process (SAL-GP) framework that mirrors the layered architecture of the target neural network. Instead of applying a single global GP correction, SAL-GP employs a multi-layer GP model, where each layer’s feature representation is mapped to a local calibration correction. These layerwise GPs are coupled through a structured multi-layer kernel, enabling joint marginalization across all layers. This design allows SAL-GP to capture both local semantic dependencies and global calibration coherence, while consistently propagating predictive uncertainty through the network. The resulting framework enhances interpretability aligned with the network architecture and enables principled evaluation of confidence consistency and uncertainty quantification in deep models.

[317] Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation

Dmitry Bylinkin, Mikhail Aleksandrov, Savelii Chezhegov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: This paper reformulates Physics-Informed Neural Networks (PINNs) training as a saddle-point optimization problem to address performance instability issues, demonstrating superior results compared to existing methods.

Details

Motivation: PINNs have become popular for various applications but suffer from unstable performance due to the complex landscape of their loss function, which hinders reliable training and convergence.

Method: The authors reformulate PINN training as a nonconvex-strongly concave saddle-point problem, providing theoretical foundations for this new optimization approach to handle the complex loss landscape more effectively.

Result: Extensive experimental evaluation across various tasks and architectures shows that the proposed saddle-point formulation outperforms current state-of-the-art PINN training techniques.

Conclusion: The saddle-point reformulation of PINN training successfully addresses the stability issues inherent in traditional PINN optimization, offering a more robust and effective approach for physics-informed neural network training.

Abstract: Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.

[318] Neural Probabilistic Shaping: Joint Distribution Learning for Optical Fiber Communications

Mohammad Taha Askari, Lutz Lampe, Amirhossein Ghazisaeidi

Main category: cs.LG

TL;DR: The paper proposes an autoregressive end-to-end learning method for probabilistic shaping in nonlinear fiber optic communications that learns joint symbol distributions and achieves 0.3-bits/2D information rate improvement over optimized marginal distributions for 64-QAM transmission.

Details

Motivation: Traditional probabilistic shaping methods for fiber optic communications typically optimize marginal symbol distributions, but this approach may not fully exploit the potential of joint symbol distribution optimization for nonlinear fiber channels, leaving room for performance improvements.

Method: An autoregressive end-to-end learning approach that learns the joint symbol distribution for probabilistic shaping on nonlinear fiber channels, rather than relying solely on optimized marginal distributions.

Result: The proposed method achieves a 0.3-bits/2D achievable information rate gain compared to an optimized marginal distribution approach when tested on dual-polarized 64-QAM transmission over a single-span 205 km fiber link.

Conclusion: Autoregressive end-to-end learning for joint symbol distribution optimization provides superior performance over traditional marginal distribution optimization in probabilistic shaping for nonlinear fiber optic communications, demonstrating the value of considering symbol dependencies.

Abstract: We present an autoregressive end-to-end learning approach for probabilistic shaping on nonlinear fiber channels. Our proposed scheme learns the joint symbol distribution and provides a 0.3-bits/2D achievable information rate gain over an optimized marginal distribution for dual-polarized 64-QAM transmission over a single-span 205 km link.

[319] Reactivation: Empirical NTK Dynamics Under Task Shifts

Yuzhi Liu, Zixuan Chen, Zirui Zhang, Yufei Liu, Giulia Lanzillotta

Main category: cs.LG

TL;DR: This paper empirically analyzes Neural Tangent Kernel (NTK) dynamics in continual learning settings where data distribution shifts over time, challenging static-kernel approximations used in theoretical continual learning analysis.

Details

Motivation: Previous NTK research focused only on single-task settings with constant data distributions, but continual learning with shifting data distributions remains unexplored despite being crucial for understanding neural network dynamics and feature learning in realistic scenarios.

Method: Comprehensive empirical analysis of NTK dynamics during continual learning where data distribution changes over time, examining how the kernel evolves as networks encounter sequential tasks with different data distributions.

Result: The findings reveal that continual learning provides a rich testbed for studying neural training dynamics, and demonstrate that NTK evolution is necessary for feature learning in continual settings, contradicting static-kernel assumptions.

Conclusion: Static-kernel approximations commonly used in theoretical continual learning analysis are invalid even at large scale, and continual learning scenarios offer valuable insights into neural network training dynamics that cannot be captured by single-task studies.

Abstract: The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks. In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space. The evolution of the NTK during training is necessary for feature learning, a key driver of deep learning success. The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours. However, this body of work has been limited to the single task setting, where the data distribution is assumed constant over time. In this work, we present a comprehensive empirical analysis of NTK dynamics in continual learning, where the data distribution shifts over time. Our findings highlight continual learning as a rich and underutilized testbed for probing the dynamics of neural training. At the same time, they challenge the validity of static-kernel approximations in theoretical treatments of continual learning, even at large scale.

[320] A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks

Yuta Nakahara, Manabu Kobayashi, Toshiyasu Matsushima

Main category: cs.LG

TL;DR: This paper theoretically analyzes ternary neural networks (parameters restricted to {-1, 0, +1}) by studying their expressivity through the number of linear regions, proving they can achieve comparable expressivity to standard neural networks with appropriate scaling of width or depth.

Details

Motivation: While ternary neural networks show excellent practical performance in applications like image recognition and NLP with reduced computational complexity and memory consumption, their theoretical understanding remains insufficient, particularly regarding their expressivity capabilities.

Method: The authors theoretically analyze ternary neural networks’ expressivity by evaluating the number of linear regions in ternary regression networks with ReLU activation functions, using mathematical proofs to establish relationships between network architecture and expressivity.

Result: The number of linear regions in ternary neural networks increases polynomially with network width and exponentially with depth, similar to standard neural networks. Squaring the width or doubling the depth of ternary networks can achieve comparable lower bounds on maximum linear regions as general ReLU regression networks.

Conclusion: The theoretical analysis provides mathematical justification for the practical success of ternary neural networks, showing that despite parameter restrictions to {-1, 0, +1}, they can achieve comparable expressivity to standard networks through appropriate architectural scaling.

Abstract: With the advancement of deep learning, reducing computational complexity and memory consumption has become a critical challenge, and ternary neural networks (NNs) that restrict parameters to ${-1, 0, +1}$ have attracted attention as a promising approach. While ternary NNs demonstrate excellent performance in practical applications such as image recognition and natural language processing, their theoretical understanding remains insufficient. In this paper, we theoretically analyze the expressivity of ternary NNs from the perspective of the number of linear regions. Specifically, we evaluate the number of linear regions of ternary regression NNs with Rectified Linear Unit (ReLU) for activation functions and prove that the number of linear regions increases polynomially with respect to network width and exponentially with respect to depth, similar to standard NNs. Moreover, we show that it suffices to either square the width or double the depth of ternary NNs to achieve a lower bound on the maximum number of linear regions comparable to that of general ReLU regression NNs. This provides a theoretical explanation, in some sense, for the practical success of ternary NNs.

[321] TorchAO: PyTorch-Native Training-to-Serving Model Optimization

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samardžić

Main category: cs.LG

TL;DR: TorchAO is a PyTorch-native model optimization framework that provides end-to-end quantization and sparsity techniques for AI models, supporting various data types and integrating with the broader ML ecosystem from training to serving.

Details

Motivation: The need for a unified, end-to-end model optimization workflow that connects the fragmented space of model optimization techniques, from training to serving, while providing backend-agnostic support for various low precision data types.

Method: TorchAO implements a novel tensor subclass abstraction to represent various low precision data types (INT4, INT8, FP8, MXFP4, MXFP6, MXFP8) and supports multiple optimization techniques including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, with integration across the ML pipeline from pre-training to serving.

Result: TorchAO successfully enabled the launch of quantized Llama 3.2 1B/3B and LlamaGuard3-8B models, demonstrating practical application in real-world model deployments while providing seamless integration with popular frameworks like TorchTitan, TorchTune, HuggingFace, vLLM, and others.

Conclusion: TorchAO provides a comprehensive, PyTorch-native solution for model optimization that unifies previously fragmented optimization techniques into a single workflow, successfully bridging the gap between training and serving while supporting diverse quantization and sparsity methods.

Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.

[322] Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer’s Disease Progression

Jindong Wang, Yutong Mao, Xiao Liu, Wenrui Hao

Main category: cs.LG

TL;DR: This paper presents a machine learning framework using neural operators to create personalized models of Alzheimer’s disease progression, achieving over 90% prediction accuracy by learning patient-specific disease dynamics from multimodal data.

Details

Motivation: Alzheimer's disease shows substantial heterogeneity in progression and treatment response, but current predictive models are limited in accurately forecasting individualized disease trajectories despite recent therapeutic advances.

Method: The authors developed a machine learning-based operator learning framework that directly learns patient-specific disease operators governing spatiotemporal evolution of biomarkers, using Laplacian eigenfunction bases to construct geometry-aware neural operators within a digital twin paradigm.

Result: The method achieved high prediction accuracy exceeding 90% across multiple biomarkers when applied to AD clinical data, substantially outperforming existing approaches, and enables individualized predictions and simulation of therapeutic interventions.

Conclusion: This work provides a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases, offering significant improvements over conventional approaches with prespecified dynamics.

Abstract: Alzheimer’s disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progression, integrating longitudinal multimodal imaging, biomarker, and clinical data. Unlike conventional models with prespecified dynamics, our approach directly learns patient-specific disease operators governing the spatiotemporal evolution of amyloid, tau, and neurodegeneration biomarkers. Using Laplacian eigenfunction bases, we construct geometry-aware neural operators capable of capturing complex brain dynamics. Embedded within a digital twin paradigm, the framework enables individualized predictions, simulation of therapeutic interventions, and in silico clinical trials. Applied to AD clinical data, our method achieves high prediction accuracy exceeding 90% across multiple biomarkers, substantially outperforming existing approaches. This work offers a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases.

[323] LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

Main category: cs.LG

TL;DR: This paper proposes a Data Weighting Model (DWM) that dynamically adjusts training data weights during LLM training using bi-level optimization, improving upon static data selection methods by adapting to the model’s evolving data preferences throughout training.

Details

Motivation: Current data selection methods for LLM training use static, training-agnostic criteria that don't account for dynamic interactions between model training and data. There's a need for adaptive data selection that responds to how the model's data preferences evolve during training to improve efficiency and reduce computational costs.

Method: The authors develop a Data Weighting Model (DWM) that adjusts the weight of selected data within each training batch. They implement a bi-level optimization framework to update the weighting model, enabling it to capture the dynamic data preferences of the trained model as training progresses.

Result: DWM enhances performance of models trained with randomly-selected data. The learned weighting model demonstrates transferability - it can be applied to improve other data selection methods and works across models of different sizes. The method provides insights into how model data preferences evolve during training.

Conclusion: The proposed DWM successfully addresses limitations of static data selection by providing dynamic, adaptive data weighting during LLM training. The bi-level optimization approach effectively captures evolving model preferences, leading to improved training efficiency and transferable improvements across different models and data selection methods.

Abstract: While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

[324] EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding

Kaiyuan Li, Pengyu Wang, Yunshan Peng, Pengjia Yuan, Yanxiang Zeng, Rui Xiang, Yanhua Cheng, Xialong Liu, Peng Jiang

Main category: cs.LG

TL;DR: This paper proposes EBaReT, a novel Expert-guided Bag Reward Transformer that improves automated bidding by addressing data quality issues through expert trajectory generation and PU learning-based discriminators, while using a bag reward mechanism to handle uncertain rewards in bidding environments.

Details

Motivation: Traditional reinforcement learning approaches for automated bidding suffer from low data quality due to sub-optimal bids and uncertain rewards caused by low click and conversion rates. Existing generative RL methods rely on supervised learning which is vulnerable to these data quality issues, and few studies have addressed these fundamental challenges in bidding environments.

Method: The authors propose Expert-guided Bag Reward Transformer (EBaReT) which: (1) generates expert trajectories as supplementary training data, (2) uses a Positive-Unlabeled learning-based discriminator to identify expert transitions, (3) implements an expert-guided inference strategy to ensure expert-level decisions, and (4) designs a bag reward function that groups transitions within certain periods to achieve smoother reward acquisition.

Result: Extensive experiments demonstrate that EBaReT achieves superior performance compared to state-of-the-art bidding methods, effectively addressing both data quality issues and reward uncertainty in automated bidding scenarios.

Conclusion: The proposed EBaReT framework successfully addresses key challenges in automated bidding by incorporating expert guidance and bag reward mechanisms, providing a more robust solution for sequence decision-making in bidding environments with improved performance over existing methods.

Abstract: Reinforcement learning has been widely applied in automated bidding. Traditional approaches model bidding as a Markov Decision Process (MDP). Recently, some studies have explored using generative reinforcement learning methods to address long-term dependency issues in bidding environments. Although effective, these methods typically rely on supervised learning approaches, which are vulnerable to low data quality due to the amount of sub-optimal bids and low probability rewards resulting from the low click and conversion rates. Unfortunately, few studies have addressed these challenges. In this paper, we formalize the automated bidding as a sequence decision-making problem and propose a novel Expert-guided Bag Reward Transformer (EBaReT) to address concerns related to data quality and uncertainty rewards. Specifically, to tackle data quality issues, we generate a set of expert trajectories to serve as supplementary data in the training process and employ a Positive-Unlabeled (PU) learning-based discriminator to identify expert transitions. To ensure the decision also meets the expert level, we further design a novel expert-guided inference strategy. Moreover, to mitigate the uncertainty of rewards, we consider the transitions within a certain period as a “bag” and carefully design a reward function that leads to a smoother acquisition of rewards. Extensive experiments demonstrate that our model achieves superior performance compared to state-of-the-art bidding methods.

[325] Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection

Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe

Main category: cs.LG

TL;DR: This paper investigates how bounded activation functions can improve the robustness of compressed deep neural networks against soft errors in safety-critical applications, while maintaining accuracy and compressibility for semantic segmentation tasks.

Details

Motivation: Modern electronic devices are increasingly susceptible to soft errors due to shrinking transistor geometries and decreasing voltages, especially in safety-critical applications like aerospace and autonomous driving. Compression techniques alter model robustness, and activation function choice affects not only accuracy but also compressibility and error resilience, which is often overlooked in current research.

Method: The authors explore bounded activation functions to enhance robustness against parameter perturbations using a technology-agnostic approach. They evaluate encoder-decoder convolutional models for semantic segmentation of hyperspectral images and conduct experiments on AMD-Xilinx’s KV260 System-on-Module, assessing effects on model accuracy, compressibility, and computational load.

Result: The study demonstrates that bounded activation functions can enhance robustness against soft errors while maintaining acceptable levels of model accuracy, compressibility, and computational efficiency in encoder-decoder networks for hyperspectral image semantic segmentation.

Conclusion: Bounded activation functions offer a promising approach to improve soft error resilience in compressed deep neural networks for safety-critical embedded systems, providing a balance between robustness, accuracy, and computational efficiency that is essential for autonomous driving applications.

Abstract: Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx’s KV260 SoM.

[326] RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Yang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, Zidong Du, Qi Guo, Xing Hu

Main category: cs.LG

TL;DR: RealBench is the first benchmark for evaluating Large Language Models (LLMs) on real-world IP-level Verilog code generation, featuring complex designs, multi-modal specifications, and rigorous verification. Even the best-performing LLM (o1-preview) achieves only 13.3% success on module-level tasks and 0% on system-level tasks.

Details

Motivation: Existing benchmarks for evaluating LLMs in Verilog generation are inadequate for real-world applications due to their simplicity, poor design specifications, and insufficient verification environments. There is a need for a more realistic and comprehensive benchmark that mirrors actual hardware design workflows.

Method: The authors developed RealBench, which includes: (1) complex, structured, real-world open-source IP designs, (2) multi-modal and formatted design specifications, (3) rigorous verification environments with 100% line coverage testbenches and formal checkers, and (4) support for both module-level and system-level tasks to enable comprehensive LLM assessment.

Result: Evaluation results show that current LLMs perform poorly on real-world Verilog generation tasks. The best-performing model (o1-preview) achieved only 13.3% pass@1 success rate on module-level tasks and 0% success rate on system-level tasks, demonstrating significant room for improvement.

Conclusion: The study reveals a substantial gap between current LLM capabilities and real-world hardware design requirements. The poor performance across all tested models highlights the need for developing stronger and more specialized Verilog generation models to meet practical industry demands.

Abstract: The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs’ simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks, highlighting the need for stronger Verilog generation models in the future. The benchmark is open-sourced at https://github.com/IPRC-DIP/RealBench.

Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li

Main category: cs.LG

TL;DR: METER is a unified multi-modal benchmark for interpretable forgery detection across images, videos, audio, and audio-visual content that provides not only binary classification but also detailed explanations including spatio-temporal localization and forgery type tracing, accompanied by a three-stage Chain-of-Thought training strategy.

Details

Motivation: Existing forgery detection methods focus only on binary classification without interpretable explanations, treat modalities separately without unified benchmarks, and lack applicability in safety-critical scenarios due to limited explainability, while the rapid advancement of generative AI has made synthetic content increasingly realistic and amplified misinformation risks.

Method: The authors introduce METER, a unified multi-modal benchmark with four tracks requiring real-vs-fake classification plus evidence-chain-based explanations (spatio-temporal localization, textual rationales, forgery type tracing), and propose a human-aligned three-stage Chain-of-Thought (CoT) training strategy combining Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a novel GRPO stage that integrates human-aligned evaluator with CoT reasoning.

Result: METER provides broader modality coverage compared to prior benchmarks and offers richer interpretability metrics including spatial/temporal IoU, multi-class tracing, and evidence consistency, establishing a comprehensive evaluation framework for cross-modal forgery detection and interpretation.

Conclusion: METER serves as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media, addressing the critical need for explainable AI systems in safety-critical scenarios involving synthetic content detection across multiple modalities.

Abstract: With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.

[328] Aligned Manifold Property and Topology Point Clouds for Learning Molecular Properties

Alexander Mihalcea

Main category: cs.LG

TL;DR: This paper introduces AMPTCR, a novel molecular surface representation that combines quantum-derived scalar fields and topological descriptors in an aligned point cloud format to better capture surface-local phenomena for molecular property prediction, achieving strong performance on molecular weight prediction (R²=0.87) and bacterial growth inhibition tasks (ROC AUC=0.912).

Details

Motivation: Existing molecular property prediction models rely on representations like SMILES strings and molecular graphs that overlook surface-local phenomena driving intermolecular behavior. 3D-based approaches either reduce surface detail or require computationally expensive SE(3)-equivariant architectures to handle spatial variance, creating a need for better molecular surface representations.

Method: The authors develop AMPTCR (Aligned Manifold Property and Topology Cloud Representation), which combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format. Each surface point includes chemically meaningful scalars, geodesically derived topology vectors, and coordinates transformed into a canonical reference frame. This representation is evaluated using a DGCNN framework on molecular weight and bacterial growth inhibition prediction tasks.

Result: AMPTCR demonstrates strong performance across evaluation tasks: molecular weight prediction achieved validation R² of 0.87, confirming the representation encodes physically meaningful data. For bacterial growth inhibition, the method achieved ROC AUC of 0.912 for classification and R² of 0.54 for regression tasks when using Dual Fukui functions as electronic descriptors and Morgan Fingerprints as auxiliary data.

Conclusion: AMPTCR provides a compact, expressive, and architecture-agnostic representation for modeling surface-mediated molecular properties. The results demonstrate that this approach successfully captures surface-local phenomena important for intermolecular behavior while enabling efficient learning with conventional SE(3)-sensitive architectures, offering advantages over existing molecular representation methods.

Abstract: Machine learning models for molecular property prediction generally rely on representations – such as SMILES strings and molecular graphs – that overlook the surface-local phenomena driving intermolecular behavior. 3D-based approaches often reduce surface detail or require computationally expensive SE(3)-equivariant architectures to manage spatial variance. To overcome these limitations, this work introduces AMPTCR (Aligned Manifold Property and Topology Cloud Representation), a molecular surface representation that combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format. Each surface point includes a chemically meaningful scalar, geodesically derived topology vectors, and coordinates transformed into a canonical reference frame, enabling efficient learning with conventional SE(3)-sensitive architectures. AMPTCR is evaluated using a DGCNN framework on two tasks: molecular weight and bacterial growth inhibition. For molecular weight, results confirm that AMPTCR encodes physically meaningful data, with a validation R^2 of 0.87. In the bacterial inhibition task, AMPTCR enables both classification and direct regression of E. coli inhibition values using Dual Fukui functions as the electronic descriptor and Morgan Fingerprints as auxiliary data, achieving an ROC AUC of 0.912 on the classification task, and an R^2 of 0.54 on the regression task. These results help demonstrate that AMPTCR offers a compact, expressive, and architecture-agnostic representation for modeling surface-mediated molecular properties.

[329] Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang

Main category: cs.LG

TL;DR: STWeaver is a GPU memory allocator that reduces memory fragmentation in large language model training by up to 100% through combined offline planning and online allocation, enabling more efficient training with up to 32.5% performance improvement.

Details

Motivation: Large language models face severe GPU memory pressure due to scaling, which is worsened by training optimization techniques like virtual pipeline and recomputation that cause memory fragmentation. Default allocators in frameworks like PyTorch waste up to 43% of memory and cause out-of-memory errors, making optimization techniques ineffective.

Method: STWeaver combines offline planning with online allocation by exploiting spatial and temporal regularity in memory allocation patterns. The offline component generates near-optimal allocation plans using spatio-temporal regularities, while the online component handles dynamic models like Mixture-of-Experts. It’s implemented as a pluggable PyTorch allocator.

Result: STWeaver reduces memory fragmentation ratio by 79.2% on average (up to 100%) across both dense and sparse models with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5%.

Conclusion: STWeaver successfully addresses GPU memory fragmentation in LLM training through its novel hybrid allocation approach, significantly improving memory efficiency and training performance while maintaining compatibility with existing frameworks.

Abstract: The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2% (up to 100%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5%.

[330] Understanding Generalization, Robustness, and Interpretability in Low-Capacity Neural Networks

Yash Kumar

Main category: cs.LG

TL;DR: This paper investigates the relationship between model capacity, sparsity, and robustness in low-capacity neural networks using binary classification tasks from MNIST with varying visual difficulty, finding that sparse subnetworks can maintain performance while over-parameterization improves robustness.

Details

Motivation: While modern deep learning uses massive over-parameterized models, understanding the fundamental relationships between capacity, sparsity, and robustness in simpler, low-capacity networks remains crucial for theoretical understanding and practical applications where computational resources are limited.

Method: The authors created a controlled experimental framework using binary classification tasks from MNIST dataset with increasing visual difficulty (e.g., distinguishing 0 vs 1 compared to 4 vs 9). They systematically studied model capacity requirements, applied extreme magnitude pruning (up to 95% sparsity), and analyzed robustness to input corruption. Interpretability analysis using saliency maps was used to understand the reasoning process of sparse subnetworks.

Result: Three key findings emerged: (1) minimum model capacity scales directly with task complexity, (2) trained networks can withstand extreme pruning up to 95% sparsity while maintaining performance, indicating existence of sparse high-performing subnetworks, and (3) over-parameterization significantly improves robustness against input corruption. Saliency maps confirmed that sparse subnetworks preserve the core reasoning of dense models.

Conclusion: The study provides empirical evidence for fundamental trade-offs in simple neural networks, demonstrating that sparse subnetworks can maintain both performance and interpretability while over-parameterization offers robustness benefits, contributing to theoretical understanding of capacity-sparsity-robustness relationships in neural networks.

Abstract: Although modern deep learning often relies on massive over-parameterized models, the fundamental interplay between capacity, sparsity, and robustness in low-capacity networks remains a vital area of study. We introduce a controlled framework to investigate these properties by creating a suite of binary classification tasks from the MNIST dataset with increasing visual difficulty (e.g., 0 and 1 vs. 4 and 9). Our experiments reveal three core findings. First, the minimum model capacity required for successful generalization scales directly with task complexity. Second, these trained networks are robust to extreme magnitude pruning (up to 95% sparsity), revealing the existence of sparse, high-performing subnetworks. Third, we show that over-parameterization provides a significant advantage in robustness against input corruption. Interpretability analysis via saliency maps further confirms that these identified sparse subnetworks preserve the core reasoning process of the original dense models. This work provides a clear, empirical demonstration of the foundational trade-offs governing simple neural networks.

[331] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang

Main category: cs.LG

TL;DR: The paper proposes ResAlign, a safety-driven unlearning framework that makes text-to-image diffusion models more resistant to recovering harmful behaviors when fine-tuned, addressing the fragility of existing safety methods against downstream fine-tuning.

Details

Motivation: Text-to-image diffusion models often inherit unsafe behaviors from toxic pretraining data, and existing safety-driven unlearning methods fail to maintain their effectiveness when models are fine-tuned on downstream tasks, even with benign datasets.

Method: ResAlign uses a Moreau Envelope-based reformulation to model downstream fine-tuning as an implicit optimization problem, enabling efficient gradient estimation to minimize harmful behavior recovery. It also employs a meta-learning strategy to simulate diverse fine-tuning scenarios for better generalization.

Result: Extensive experiments across various datasets, fine-tuning methods, and configurations show that ResAlign consistently outperforms prior unlearning approaches in maintaining safety after downstream fine-tuning while preserving benign generation capabilities.

Conclusion: ResAlign successfully addresses the fragility problem of safety-driven unlearning methods by providing enhanced resilience against downstream fine-tuning, making text-to-image models safer for personalized applications.

Abstract: Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.

[332] Perovskite-R1: A Domain-Specialized LLM for Intelligent Discovery of Precursor Additives and Experimental Design

Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu

Main category: cs.LG

TL;DR: Researchers developed Perovskite-R1, a specialized large language model trained on 1,232 scientific publications and 33,269 materials to accelerate discovery and design of perovskite solar cell precursor additives, demonstrating effectiveness through experimental validation.

Details

Motivation: Perovskite solar cells show exceptional efficiency but face commercialization challenges including stability, sustainability, and manufacturing scalability. The explosive growth of scientific literature makes it difficult for researchers to efficiently access and utilize domain knowledge for precursor additive engineering solutions.

Method: The authors created a domain-specific instruction-tuning dataset by mining 1,232 high-quality publications and integrating 33,269 candidate materials. They used automated question-answer generation and chain-of-thought reasoning to fine-tune the QwQ-32B model, creating Perovskite-R1 with advanced reasoning capabilities for PSC research.

Result: Perovskite-R1 successfully synthesizes literature insights and generates innovative solutions for defect passivation and precursor additive selection. Experimental validation confirmed the effectiveness of several model-proposed strategies in improving material stability and performance.

Conclusion: The work demonstrates that domain-adapted large language models can effectively accelerate materials discovery in perovskite photovoltaics, providing a closed-loop framework for intelligent, data-driven research advancements in this field.

Abstract: Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

[333] The Cost of Compression: Tight Quadratic Black-Box Attacks on Sketches for $\ell_2$ Norm Estimation

Sara Ahmadian, Edith Cohen, Uri Stemmer

Main category: cs.LG

TL;DR: This paper presents a universal black-box adversarial attack against linear sketching methods for dimensionality reduction that can break any sketching matrix and estimator using O(k²) queries, matching known theoretical upper bounds.

Details

Motivation: Linear sketching for dimensionality reduction is widely used but known to be vulnerable to adversarial inputs. The paper aims to understand the fundamental limits of adversarial robustness in the black-box setting where attackers can only query the system without knowing the internal sketching matrix.

Method: The authors develop a universal, non-adaptive black-box attack that uses approximately O(k²) queries to either cause failure in norm estimation or construct adversarial inputs that fool the optimal estimator. The attack is agnostic to both the sketching matrix and the estimator type.

Result: The attack successfully breaks any linear sketch and any query responder (including randomized, adaptive, or distribution-tailored ones) using O(k²) queries. This lower bound construction tightly matches the known upper bounds of Ω(k²) for specialized estimators like Johnson-Lindenstrauss transforms and AMS sketches.

Conclusion: The results reveal fundamental vulnerabilities in compressed representations and establish tight bounds on adversarial robustness of linear sketching. The findings also highlight structural parallels to adversarial attacks in image classification, suggesting broader implications for understanding security of dimensionality reduction techniques.

Abstract: Dimensionality reduction via linear sketching is a powerful and widely used technique, but it is known to be vulnerable to adversarial inputs. We study the black-box adversarial setting, where a fixed, hidden sketching matrix A in $R^{k X n}$ maps high-dimensional vectors v $\in R^n$ to lower-dimensional sketches A v in $R^k$, and an adversary can query the system to obtain approximate ell2-norm estimates that are computed from the sketch. We present a universal, nonadaptive attack that, using tilde(O)($k^2$) queries, either causes a failure in norm estimation or constructs an adversarial input on which the optimal estimator for the query distribution (used by the attack) fails. The attack is completely agnostic to the sketching matrix and to the estimator: It applies to any linear sketch and any query responder, including those that are randomized, adaptive, or tailored to the query distribution. Our lower bound construction tightly matches the known upper bounds of tilde(Omega)($k^2$), achieved by specialized estimators for Johnson Lindenstrauss transforms and AMS sketches. Beyond sketching, our results uncover structural parallels to adversarial attacks in image classification, highlighting fundamental vulnerabilities of compressed representations.

[334] Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao

Main category: cs.LG

TL;DR: HPGNN integrates Higher-order Personalized PageRank with Graph Neural Networks to better handle heterophilic graphs by capturing multi-scale node interactions and reducing noise, achieving superior performance on heterophilic datasets while maintaining competitiveness on homophilic ones.

Details

Motivation: Existing GNNs assume homophily (connected nodes have similar labels) which doesn't hold in many real-world heterophilic graphs. Current models for heterophilic graphs rely mainly on pairwise relationships and overlook multi-scale information from higher-order structures, leading to suboptimal performance especially when dealing with conflicting class information across nodes.

Method: The paper proposes HPGNN which integrates an efficient high-order approximation of Personalized PageRank (PPR) with Graph Neural Networks. This approach captures long-range and multi-scale node interactions while reducing computational complexity and mitigating noise from surrounding information by embedding higher-order structural information into convolutional networks.

Result: HPGNN outperforms five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. The model demonstrates effectiveness across benchmark datasets and shows improved ability to balance multi-scale information with robustness to noise.

Conclusion: HPGNN provides a versatile solution for real-world graph learning challenges by effectively modeling key interactions across diverse graph dimensions. The integration of higher-order PPR with GNNs successfully addresses the limitations of existing approaches in handling heterophilic graphs while preserving performance on traditional homophilic scenarios.

Abstract: Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN’s effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN’s ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.

[335] Optimization and generalization analysis for two-layer physics-informed neural networks without over-parametrization

Zhihan Zeng, Yiqi Gu

Main category: cs.LG

TL;DR: This paper analyzes SGD optimization for physics-informed neural networks (PINNs) in least-squares regression without requiring over-parameterization, showing that network width only needs to exceed a problem-dependent threshold to achieve O(ε) loss convergence.

Details

Motivation: Previous PINN optimization theory relied on over-parameterization regimes that require prohibitively large network widths scaling with training samples, making the theory computationally impractical and distant from real experimental settings.

Method: The authors perform new optimization and generalization analysis for SGD training of two-layer PINNs by making specific assumptions about the target function to avoid the over-parameterization requirement, establishing width thresholds that depend only on the desired accuracy ε and problem characteristics.

Result: The paper proves that when network width exceeds a threshold depending only on ε and the problem (not on the number of training samples), both training loss and expected loss decrease below O(ε), providing more practical convergence guarantees.

Conclusion: The work provides a more practical theoretical framework for PINN optimization that avoids over-parameterization constraints, offering convergence guarantees with reasonable computational requirements that better align with practical implementations.

Abstract: This work focuses on the behavior of stochastic gradient descent (SGD) in solving least-squares regression with physics-informed neural networks (PINNs). Past work on this topic has been based on the over-parameterization regime, whose convergence may require the network width to increase vastly with the number of training samples. So, the theory derived from over-parameterization may incur prohibitive computational costs and is far from practical experiments. We perform new optimization and generalization analysis for SGD in training two-layer PINNs, making certain assumptions about the target function to avoid over-parameterization. Given $\epsilon>0$, we show that if the network width exceeds a threshold that depends only on $\epsilon$ and the problem, then the training loss and expected loss will decrease below $O(\epsilon)$.

[336] Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling

Ivona Krchova, Michael Platzer, Paul Tiwald

Main category: cs.LG

TL;DR: This paper benchmarks AI-generated synthetic data for addressing class imbalance in tabular datasets, showing that synthetic upsampling outperforms traditional methods like SMOTE-NC for improving minority class prediction accuracy.

Details

Motivation: Unbalanced tabular datasets present significant challenges in real-world applications like fraud detection and medical diagnosis, where minority classes are vastly underrepresented. Traditional machine learning algorithms favor majority classes, leading to biased models that struggle with minority class prediction accuracy.

Method: The study evaluates the MOSTLY AI Synthetic Data SDK for synthetic upsampling of highly unbalanced tabular datasets. They compare predictive models trained on synthetically upsampled data against standard methods including naive oversampling and SMOTE-NC across mixed-type datasets.

Result: Synthetic data upsampling consistently produces top-performing predictive models, particularly for mixed-type datasets with very few minority samples. The synthetic data generates diverse data points that fill gaps in sparse regions of the feature space, improving predictive accuracy for minority groups.

Conclusion: AI-generated synthetic data provides an effective solution for addressing class imbalance in tabular data, outperforming traditional upsampling methods by creating realistic and diverse samples that enhance minority class representation and improve overall model performance.

Abstract: Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data. We compare predictive models trained on data sets upsampled with synthetic records to those using standard methods, such as naive oversampling and SMOTE-NC. Our results demonstrate that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space. We show that upsampled synthetic training data consistently results in top-performing predictive models, particularly for mixed-type data sets containing very few minority samples.

[337] RIS-aided Latent Space Alignment for Semantic Channel Equalization

Tomás Hüttebräucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: This paper proposes a joint physical and semantic channel equalization framework using Reconfigurable Intelligent Surfaces (RIS) to address semantic mismatches in multi-user MIMO semantic communication systems, where AI devices trained independently develop divergent latent representations that impede mutual understanding.

Details

Motivation: In semantic communication systems using Deep Neural Networks, independently trained AI devices in multi-user settings develop divergent latent representations without shared context or joint optimization, leading to semantic mismatches that prevent mutual understanding even without traditional transmission errors.

Method: The authors propose a joint physical and semantic channel equalization framework leveraging RIS in MIMO channels. The semantic equalization involves three stages: pre-equalization at transmitter, propagation through RIS-aided channel, and post-equalization at receiver. They formulate this as a constrained MMSE optimization problem and develop two solutions: linear semantic equalization chain and non-linear DNN-based semantic equalizer, both operating under semantic compression and power constraints.

Result: Extensive evaluations demonstrate that the proposed joint equalization strategies consistently outperform conventional disjoint approaches to physical and semantic channel equalization across various scenarios and wireless channel conditions.

Conclusion: The joint physical and semantic channel equalization framework using RIS effectively addresses semantic mismatches in multi-user MIMO semantic communication systems, showing superior performance compared to traditional separated equalization approaches across diverse wireless conditions.

Abstract: Semantic communication systems introduce a new paradigm in wireless communications, focusing on transmitting the intended meaning rather than ensuring strict bit-level accuracy. These systems often rely on Deep Neural Networks (DNNs) to learn and encode meaning directly from data, enabling more efficient communication. However, in multi-user settings where interacting agents are trained independently-without shared context or joint optimization-divergent latent representations across AI-native devices can lead to semantic mismatches, impeding mutual understanding even in the absence of traditional transmission errors. In this work, we address semantic mismatch in Multiple-Input Multiple-Output (MIMO) channels by proposing a joint physical and semantic channel equalization framework that leverages the presence of Reconfigurable Intelligent Surfaces (RIS). The semantic equalization is implemented as a sequence of transformations: (i) a pre-equalization stage at the transmitter; (ii) propagation through the RIS-aided channel; and (iii) a post-equalization stage at the receiver. We formulate the problem as a constrained Minimum Mean Squared Error (MMSE) optimization and propose two solutions: (i) a linear semantic equalization chain, and (ii) a non-linear DNN-based semantic equalizer. Both methods are designed to operate under semantic compression in the latent space and adhere to transmit power constraints. Through extensive evaluations, we show that the proposed joint equalization strategies consistently outperform conventional, disjoint approaches to physical and semantic channel equalization across a broad range of scenarios and wireless channel conditions.

[338] Canonical Correlation Patterns for Validating Clustering of Multivariate Time Series

Isabella Degen, Zahraa S Abdallah, Kate Robson Brown, Henry W J Reeve

Main category: cs.LG

TL;DR: This paper introduces canonical correlation patterns as validation targets for correlation-based clustering of multivariate time series, addressing the fundamental challenge of validating whether discovered clusters represent distinct relationships rather than arbitrary groupings.

Details

Motivation: Existing clustering validity indices were designed for Euclidean data and their effectiveness for correlation patterns has not been systematically evaluated. Unlike Euclidean clustering with discrete geometric reference targets, correlations exist in continuous space without equivalent reference patterns, making validation challenging in high-stakes applications like health, finance, and industry.

Method: The authors introduce canonical correlation patterns as mathematically defined validation targets that discretize the infinite correlation space into finite, interpretable reference patterns. They use synthetic datasets with perfect ground truth across controlled conditions to evaluate different norms and validity indices, specifically testing L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index.

Result: The proposed canonical patterns provide reliable validation targets, with L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index demonstrating superior performance. The methods show robustness to distribution shifts and appropriately detect correlation structure degradation.

Conclusion: The work establishes a methodological foundation for rigorous correlation-based clustering validation in high-stakes domains by providing practical implementation guidelines and reliable validation targets for correlation-based clustering of multivariate time series.

Abstract: Clustering of multivariate time series using correlation-based methods reveals regime changes in relationships between variables across health, finance, and industrial applications. However, validating whether discovered clusters represent distinct relationships rather than arbitrary groupings remains a fundamental challenge. Existing clustering validity indices were developed for Euclidean data, and their effectiveness for correlation patterns has not been systematically evaluated. Unlike Euclidean clustering, where geometric shapes provide discrete reference targets, correlations exist in continuous space without equivalent reference patterns. We address this validation gap by introducing canonical correlation patterns as mathematically defined validation targets that discretise the infinite correlation space into finite, interpretable reference patterns. Using synthetic datasets with perfect ground truth across controlled conditions, we demonstrate that canonical patterns provide reliable validation targets, with L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index showing superior performance. These methods are robust to distribution shifts and appropriately detect correlation structure degradation, enabling practical implementation guidelines. This work establishes a methodological foundation for rigorous correlation-based clustering validation in high-stakes domains.

[339] Analogy making as amortised model construction

David G. Nagy, Tingke Shen, Hanqi Zhou, Charley M. Wu, Peter Dayan

Main category: cs.LG

TL;DR: This paper proposes a framework where analogy enables flexible construction of internal models for navigation in novel situations by reusing structural components from past experiences through partial homomorphisms between Markov decision processes.

Details

Motivation: Humans need to construct internal models that are both faithful enough to the environment for effective planning and tractable to build. The challenge is balancing model accuracy with computational efficiency in novel situations.

Method: The authors formalize analogies as partial homomorphisms between Markov decision processes and propose a framework using abstract modules derived from previous construals as composable building blocks for constructing new internal models.

Result: The framework enables modular reuse of solution-relevant structure from past experiences, allowing agents to amortize computational costs of both model construction and planning while maintaining flexibility across domains.

Conclusion: Analogy plays a central role in enabling flexible adaptation of policies and representations across domains with shared structural essence, providing a computationally efficient approach to constructing useful internal models for novel situations.

Abstract: Humans flexibly construct internal models to navigate novel situations. To be useful, these internal models must be sufficiently faithful to the environment that resource-limited planning leads to adequate outcomes; equally, they must be tractable to construct in the first place. We argue that analogy plays a central role in these processes, enabling agents to reuse solution-relevant structure from past experiences and amortise the computational costs of both model construction (construal) and planning. Formalising analogies as partial homomorphisms between Markov decision processes, we sketch a framework in which abstract modules, derived from previous construals, serve as composable building blocks for new ones. This modular reuse allows for flexible adaptation of policies and representations across domains with shared structural essence.

[340] confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods

Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter

Main category: cs.LG

TL;DR: This paper introduces Configurable Optimizer (confopt), an extensible library for gradient-based one-shot neural architecture search (NAS) that addresses fragmentation issues and reveals critical flaws in current evaluation methods through new DARTS-based benchmarks.

Details

Motivation: The field of gradient-based one-shot NAS faces two major challenges: overreliance on DARTS benchmark leading to saturation with improvements within noise margins, and fragmented implementations across repositories that complicate fair comparisons and reproducible research.

Method: The authors develop Configurable Optimizer (confopt), an extensible library with a minimal API that allows easy integration of new search spaces and supports decomposition of NAS optimizers into core components. They create new DARTS-based benchmarks and implement a novel evaluation protocol.

Result: The framework successfully streamlines development and evaluation of gradient-based one-shot NAS methods. The new benchmarks and evaluation protocol reveal a critical flaw in how current gradient-based one-shot NAS methods are assessed.

Conclusion: Confopt provides a unified framework that addresses implementation fragmentation in gradient-based NAS research and exposes fundamental issues in current evaluation practices, potentially leading to more reliable and fair comparisons in the field.

Abstract: Gradient-based one-shot neural architecture search (NAS) has significantly reduced the cost of exploring architectural spaces with discrete design choices, such as selecting operations within a model. However, the field faces two major challenges. First, evaluations of gradient-based NAS methods heavily rely on the DARTS benchmark, despite the existence of other available benchmarks. This overreliance has led to saturation, with reported improvements often falling within the margin of noise. Second, implementations of gradient-based one-shot NAS methods are fragmented across disparate repositories, complicating fair and reproducible comparisons and further development. In this paper, we introduce Configurable Optimizer (confopt), an extensible library designed to streamline the development and evaluation of gradient-based one-shot NAS methods. Confopt provides a minimal API that makes it easy for users to integrate new search spaces, while also supporting the decomposition of NAS optimizers into their core components. We use this framework to create a suite of new DARTS-based benchmarks, and combine them with a novel evaluation protocol to reveal a critical flaw in how gradient-based one-shot NAS methods are currently assessed. The code can be found at https://github.com/automl/ConfigurableOptimizer.

[341] Symbolic Graph Intelligence: Hypervector Message Passing for Learning Graph-Level Patterns with Tsetlin Machines

Christian D. Blakely

Main category: cs.LG

TL;DR: A symbolic framework for graph classification using sparse binary hypervectors and Tsetlin Machines that creates interpretable, hierarchical graph representations through structured message passing while maintaining competitive accuracy with neural models.

Details

Motivation: Neural graph models lack interpretability and transparency, making it difficult to understand how they make decisions. There is a need for graph classification methods that can provide both competitive performance and local interpretability to understand the reasoning behind predictions.

Method: A multilayered symbolic framework that encodes graphs using sparse binary hypervectors and Tsetlin Machines. The method uses structured message passing to bind and bundle node, edge, and attribute information into symbolic hypervectors, preserving hierarchical semantics through layered binding from node attributes to edge relations to structural roles, resulting in compact discrete representations.

Result: The method demonstrates competitive accuracy on TUDataset benchmarks compared to neural graph models while providing strong symbolic transparency and local interpretability capabilities.

Conclusion: The proposed symbolic framework successfully combines competitive graph classification performance with interpretability, offering a transparent alternative to black-box neural graph models through hierarchical hypervector representations that preserve graph semantics.

Abstract: We propose a multilayered symbolic framework for general graph classification that leverages sparse binary hypervectors and Tsetlin Machines. Each graph is encoded through structured message passing, where node, edge, and attribute information are bound and bundled into a symbolic hypervector. This process preserves the hierarchical semantics of the graph through layered binding from node attributes to edge relations to structural roles resulting in a compact, discrete representation. We also formulate a local interpretability framework which lends itself to a key advantage of our approach being locally interpretable. We validate our method on TUDataset benchmarks, demonstrating competitive accuracy with strong symbolic transparency compared to neural graph models.

[342] A Comprehensive Data-centric Overview of Federated Graph Learning

Zhengyu Wu, Xunkai Li, Yinlin Zhu, Zekai Chen, Guochen Yan, Yanyu Yan, Hao Zhang, Yuming Ai, Xinmo Jin, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: This survey proposes a novel data-centric taxonomy for Federated Graph Learning (FGL) that categorizes research based on data characteristics and utilization, moving beyond existing methodology-focused taxonomies to better assess how FGL handles data-centric constraints and enhances model performance.

Details

Motivation: Existing FGL surveys focus primarily on integrating Federated Learning and Graph Machine Learning with methodology-centric taxonomies and simulated scenarios. There is a lack of data-centric perspective that systematically examines FGL methods through data properties and usage, which is critical for understanding how FGL tackles data-centric constraints to improve model performance.

Method: The paper proposes a two-level data-centric taxonomy: (1) Data Characteristics - categorizing studies based on structural and distributional properties of datasets used in FGL, and (2) Data Utilization - analyzing training procedures and techniques to overcome data-centric challenges. Each taxonomy level uses three orthogonal criteria representing distinct data-centric configurations.

Result: The survey successfully reorganizes FGL research through a data-centric lens, examines FGL integration with Pretrained Large Models, showcases realistic applications, and provides a comprehensive framework for understanding how different FGL approaches handle various data properties and constraints.

Conclusion: The data-centric taxonomy provides a more systematic and practical framework for organizing FGL research compared to existing methodology-focused approaches. This perspective better captures how FGL methods address real-world data challenges and offers insights for future research directions aligned with emerging trends in Graph Machine Learning.

Abstract: In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.

[343] Families of Optimal Transport Kernels for Cell Complexes

Rahul Khorana

Main category: cs.LG

TL;DR: This paper develops machine learning methods for CW complexes by deriving Wasserstein distance expressions using Hodge-Laplacian matrices and introducing novel kernels based on optimal transport theory.

Details

Motivation: Cell complexes are recognized as ideal learning representations, but there is a significant lack of available machine learning methods suitable for learning on CW complexes, creating a gap between the theoretical potential and practical applications.

Method: The authors derive an explicit expression for Wasserstein distance between cell complex signal distributions using Hodge-Laplacian matrices, extend the Fused Gromov-Wasserstein distance to CW complexes to incorporate both feature and structure information, and introduce novel kernels over probability measures on CW complexes based on dual formulation of optimal transport.

Result: The paper successfully provides a structurally meaningful measure to compare CW complexes and defines optimal transportation maps, while developing kernels that can handle probability measures on CW complexes using optimal transport principles.

Conclusion: The work bridges the gap between cell complex theory and practical machine learning by providing concrete mathematical tools (Wasserstein distances and optimal transport kernels) that enable learning on CW complexes while preserving both structural and feature information.

Abstract: Recent advances have discussed cell complexes as ideal learning representations. However, there is a lack of available machine learning methods suitable for learning on CW complexes. In this paper, we derive an explicit expression for the Wasserstein distance between cell complex signal distributions in terms of a Hodge-Laplacian matrix. This leads to a structurally meaningful measure to compare CW complexes and define the optimal transportation map. In order to simultaneously include both feature and structure information, we extend the Fused Gromov-Wasserstein distance to CW complexes. Finally, we introduce novel kernels over the space of probability measures on CW complexes based on the dual formulation of optimal transport.

[344] Scaling Linear Attention with Sparse State Expansion

Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li

Main category: cs.LG

TL;DR: The paper proposes Sparse State Expansion (SSE), a novel linear attention mechanism that addresses Transformer’s quadratic complexity in long contexts by using row-sparse updates and state partitioning, achieving strong performance in retrieval and mathematical reasoning tasks.

Details

Motivation: Transformer architectures suffer from quadratic computation and linear memory growth in long-context scenarios. Existing linear attention variants that compress context into fixed-size states often degrade performance in tasks like in-context retrieval and reasoning, creating a need for more effective context compression methods.

Method: The paper introduces two key innovations: (1) A row-sparse update formulation for linear attention that conceptualizes state updating as information classification, enabling sparse state updates via softmax-based top-k hard classification; (2) Sparse State Expansion (SSE) that expands contextual state into multiple partitions, decoupling parameter size from state capacity while maintaining sparse classification paradigm.

Result: SSE demonstrates strong retrieval performance and favorable scaling with state size. After reinforcement learning training, the 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers across language modeling, in-context retrieval, and mathematical reasoning benchmarks.

Conclusion: SSE represents a promising and efficient architecture for long-context modeling, successfully addressing the limitations of both standard Transformers and existing linear attention methods by providing effective context compression without significant performance degradation in critical tasks.

Abstract: The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

[345] Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLMs

Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, Yinuo Yang

Main category: cs.LG

TL;DR: This paper proposes a meta-learning framework for LLM-based recommender systems that uses parameter-efficient prompt-tuning to quickly personalize recommendations for cold-start users, achieving real-time performance and outperforming baselines on multiple datasets.

Details

Motivation: Current LLM-based recommender systems struggle with cold-start users who have little to no interaction history, and existing solutions like supervised fine-tuning and collaborative filtering are expensive to maintain and update while being focused on dense user-item interactions.

Method: The framework employs meta-learning with first-order (Reptile) and second-order (MAML) optimization, treating each user as a task. It learns soft prompt embeddings as differentiable control variables representing user behavioral priors, using episodic sampling, inner-loop adaptation, and outer-loop generalization for meta-optimization.

Result: The model outperforms strong baselines on MovieLens-1M, Amazon Reviews, and Recbole datasets in NDCG@10, HR@10, and MRR metrics, while running in real-time (below 300 ms) on consumer GPUs. It supports zero-history personalization with 275 ms adaptation rate.

Conclusion: The framework enables scalable real-time personalization for cold-start scenarios and can be applied to financial risk profiling systems, reducing detection latency and improving payment network stability, thereby strengthening national financial infrastructure resilience.

Abstract: Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update. This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks. As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors. The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization. On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs. Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability. Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks. By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience.

[346] GASPnet: Global Agreement to Synchronize Phases

Andrea Alamiaa, Sabine Muzellec, Thomas Serre, Rufin VanRullen

Main category: cs.LG

TL;DR: This paper proposes a novel neural network mechanism inspired by brain synchrony that uses angular phases and Kuramoto dynamics to solve visual binding problems, achieving better accuracy and noise robustness than CNNs on multi-object classification tasks.

Details

Motivation: Transformer attention mechanisms are insufficient for multi-classification tasks requiring feature binding from multiple objects. The visual binding problem in neural networks needs to be addressed by separating features from different objects while binding features from the same object, inspired by the neuroscience theory of binding by synchrony.

Method: The authors incorporate angular phases into all layers of a convolutional network and use Kuramoto dynamics to achieve phase alignment. This creates a mechanism that enhances operations between neurons with similar phases while suppressing those with opposite phases, combining Transformer attention with binding by synchrony theory.

Result: The proposed mechanism shows better accuracy than CNN networks on two datasets: pairs of digits and MNIST items superimposed on CIFAR-10 images. The approach demonstrates improved noise robustness and better generalization abilities compared to baseline methods.

Conclusion: The paper successfully addresses the visual binding problem in neural networks by leveraging synergy between neuroscience (binding by synchrony) and machine learning (attention mechanisms), providing a novel solution that outperforms traditional CNNs in multi-object scenarios.

Abstract: In recent years, Transformer architectures have revolutionized most fields of artificial intelligence, relying on an attentional mechanism based on the agreement between keys and queries to select and route information in the network. In previous work, we introduced a novel, brain-inspired architecture that leverages a similar implementation to achieve a global ‘routing by agreement’ mechanism. Such a system modulates the network’s activity by matching each neuron’s key with a single global query, pooled across the entire network. Acting as a global attentional system, this mechanism improves noise robustness over baseline levels but is insufficient for multi-classification tasks. Here, we improve on this work by proposing a novel mechanism that combines aspects of the Transformer attentional operations with a compelling neuroscience theory, namely, binding by synchrony. This theory proposes that the brain binds together features by synchronizing the temporal activity of neurons encoding those features. This allows the binding of features from the same object while efficiently disentangling those from distinct objects. We drew inspiration from this theory and incorporated angular phases into all layers of a convolutional network. After achieving phase alignment via Kuramoto dynamics, we use this approach to enhance operations between neurons with similar phases and suppresses those with opposite phases. We test the benefits of this mechanism on two datasets: one composed of pairs of digits and one composed of a combination of an MNIST item superimposed on a CIFAR-10 image. Our results reveal better accuracy than CNN networks, proving more robust to noise and with better generalization abilities. Overall, we propose a novel mechanism that addresses the visual binding problem in neural networks by leveraging the synergy between neuroscience and machine learning.

[347] Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos

Main category: cs.LG

TL;DR: Flash-ABFT is a novel error detection method for attention mechanism accelerators that computes a single online checksum across the entire three-matrix product (query, key, value) including softmax operation, achieving high fault detection with only 5.3% area and 1.9% energy overhead.

Details

Motivation: Traditional algorithm-based fault tolerance (ABFT) techniques can only verify individual matrix multiplications but cannot handle the complete attention mechanism, especially due to intermediate softmax normalization. There is a need for efficient error detection in specialized hardware accelerators for Transformers and LLMs that can handle the full attention computation.

Method: Flash-ABFT computes an online checksum across the entire three-matrix product of query, key, and value matrices in an attention layer, including the softmax operation, using a single comprehensive check instead of multiple individual matrix multiplication verifications.

Result: Flash-ABFT achieves high fault-detection accuracy while incurring only 5.3% hardware area overhead and less than 1.9% energy overhead. The method significantly reduces computational overhead by eliminating redundant checks compared to traditional ABFT approaches.

Conclusion: Flash-ABFT provides a cost-effective and robust solution for error detection in attention accelerators by enabling comprehensive fault tolerance across the complete attention mechanism with minimal hardware and energy overhead, making it practical for deployment in specialized AI hardware.

Abstract: Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.

[348] Latent Space Alignment for AI-Native MIMO Semantic Communications

Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: This paper proposes a MIMO-based approach to solve semantic mismatches in communications by jointly performing latent space compression and channel equalization through learned precoder/decoder pairs.

Details

Motivation: Semantic communications can suffer from mismatches when devices use different languages, logic, or internal representations, which hinders mutual understanding and task completion in information exchange.

Method: The paper develops MIMO precoder/decoder pairs that jointly perform latent space compression and semantic channel equalization. Two solutions are explored: (1) a linear model optimized using biconvex optimization via ADMM, and (2) a neural network-based model that learns under power budget and complexity constraints.

Result: Numerical results demonstrate the effectiveness of the proposed approach in goal-oriented semantic communication scenarios, showing clear trade-offs between accuracy, communication burden, and solution complexity.

Conclusion: The proposed MIMO-based semantic communication method successfully addresses latent space misalignment while mitigating both semantic mismatches and physical channel impairments, providing flexible solutions with different complexity-performance trade-offs.

Abstract: Semantic communications focus on prioritizing the understanding of the meaning behind transmitted data and ensuring the successful completion of tasks that motivate the exchange of information. However, when devices rely on different languages, logic, or internal representations, semantic mismatches may occur, potentially hindering mutual understanding. This paper introduces a novel approach to addressing latent space misalignment in semantic communications, exploiting multiple-input multiple-output (MIMO) communications. Specifically, our method learns a MIMO precoder/decoder pair that jointly performs latent space compression and semantic channel equalization, mitigating both semantic mismatches and physical channel impairments. We explore two solutions: (i) a linear model, optimized by solving a biconvex optimization problem via the alternating direction method of multipliers (ADMM); (ii) a neural network-based model, which learns semantic MIMO precoder/decoder under transmission power budget and complexity constraints. Numerical results demonstrate the effectiveness of the proposed approach in a goal-oriented semantic communication scenario, illustrating the main trade-offs between accuracy, communication burden, and complexity of the solutions.

[349] Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Viktor Muryn, Marta Sumyk, Mariya Hirna, Sofiya Garkot, Maksym Shamrai

Main category: cs.LG

TL;DR: Screen2AX is the first framework to automatically generate real-time, tree-structured accessibility metadata from screenshots, achieving 77% F1 score in reconstructing complete accessibility trees and delivering 2.2x performance improvement for autonomous agents on desktop interfaces.

Details

Motivation: Only 33% of macOS applications provide full accessibility support, leaving many users who depend on screen readers without proper access. Existing screen representation methods address specific challenges but fail to capture the full complexity of desktop interfaces' hierarchical structure.

Method: Uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically from a single screenshot, mirroring macOS’s system-level accessibility structure. Created three datasets with 112 macOS applications annotated for UI element detection, grouping, and hierarchical accessibility metadata.

Result: Achieved 77% F1 score in reconstructing complete accessibility trees. Demonstrated 2.2x performance improvement over native accessibility representations and outperformed state-of-the-art OmniParser V2 system on the ScreenSpot benchmark using the new Screen2AX-Task benchmark.

Conclusion: Screen2AX successfully bridges the accessibility gap by automatically generating comprehensive accessibility metadata from screenshots, significantly improving both accessibility for users with disabilities and autonomous agent performance on desktop interfaces.

Abstract: Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS’s system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.

[350] Improving Model Classification by Optimizing the Training Dataset

Morad Tukan, Loay Mualem, Eitan Netzer, Liran Sigalat

Main category: cs.LG

TL;DR: This paper presents a systematic framework for improving coreset generation to enhance classification performance by introducing tunable parameters beyond traditional sensitivity scores, demonstrating superior results compared to vanilla coresets and full dataset training.

Details

Motivation: Conventional sensitivity-based coreset construction focuses on loss approximation rather than optimizing for classification performance metrics like F1 score, leading to suboptimal downstream classification quality in data-centric AI applications.

Method: The authors develop a systematic framework that introduces new tunable parameters including deterministic sampling, class-wise allocation, and refinement via active sampling, going beyond traditional sensitivity scores to optimize coreset generation for classification tasks.

Result: Through extensive experiments on diverse datasets and classifiers, tuned coresets significantly outperformed both vanilla coresets and full dataset training on key classification metrics, demonstrating improved efficiency and effectiveness in model training.

Conclusion: The proposed tunable coreset framework offers an effective path towards better and more efficient model training by systematically optimizing coreset generation for downstream classification quality rather than just loss approximation.

Abstract: In the era of data-centric AI, the ability to curate high-quality training data is as crucial as model design. Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets through importance sampling. However, conventional sensitivity-based coreset construction often falls short in optimizing for classification performance metrics, e.g., $F1$ score, focusing instead on loss approximation. In this work, we present a systematic framework for tuning the coreset generation process to enhance downstream classification quality. Our method introduces new tunable parameters–including deterministic sampling, class-wise allocation, and refinement via active sampling, beyond traditional sensitivity scores. Through extensive experiments on diverse datasets and classifiers, we demonstrate that tuned coresets can significantly outperform both vanilla coresets and full dataset training on key classification metrics, offering an effective path towards better and more efficient model training.

[351] A Partitioned Sparse Variational Gaussian Process for Fast, Distributed Spatial Modeling

Michael Grosskopf, Kellin Rumsey, Ayan Biswas, Earl Lawrence

Main category: cs.LG

TL;DR: The paper proposes Partitioned SVGP (PSVGP), a scalable machine learning approach for exascale supercomputers that enables in situ training by allowing limited communication between neighboring spatial partitions to create smoother, more accurate predictions while maintaining high scalability and memory efficiency.

Details

Motivation: Next-generation exascale supercomputers will generate far more data than can be stored, making post-hoc analysis impossible. This creates an urgent need for sophisticated machine learning algorithms that can perform in situ training on spatially distributed data partitions while being highly scalable and memory efficient.

Method: The authors extend the independent Sparse Variational Gaussian Process (SVGP) approach by introducing limited communication between neighboring spatial partitions. This Partitioned SVGP (PSVGP) uses a decentralized communication scheme to encourage better alignment of local models at partition boundaries, reducing discontinuities in response surfaces.

Result: The PSVGP approach produces smoother spatial predictions and better overall fit compared to independent SVGP models, while maintaining high scalability and adding minimal computational overhead with no additional memory requirements. The method was demonstrated on the Energy Exascale Earth System Model (E3SM).

Conclusion: The proposed PSVGP method successfully addresses the discontinuity problem of independent spatial partition modeling while preserving the scalability and efficiency required for exascale computing environments. The approach provides a practical solution for in situ machine learning on distributed spatial data.

Abstract: The next generation of Department of Energy supercomputers will be capable of exascale computation. For these machines, far more computation will be possible than that which can be saved to disk. As a result, users will be unable to rely on post-hoc access to data for uncertainty quantification and other statistical analyses and there will be an urgent need for sophisticated machine learning algorithms which can be trained in situ. Algorithms deployed in this setting must be highly scalable, memory efficient and capable of handling data which is distributed across nodes as spatially contiguous partitions. One suitable approach involves fitting a sparse variational Gaussian process (SVGP) model independently and in parallel to each spatial partition. The resulting model is scalable, efficient and generally accurate, but produces the undesirable effect of constructing discontinuous response surfaces due to the disagreement between neighboring models at their shared boundary. In this paper, we extend this idea by allowing for a small amount of communication between neighboring spatial partitions which encourages better alignment of the local models, leading to smoother spatial predictions and a better fit in general. Due to our decentralized communication scheme, the proposed extension remains highly scalable and adds very little overhead in terms of computation (and none, in terms of memory). We demonstrate this Partitioned SVGP (PSVGP) approach for the Energy Exascale Earth System Model (E3SM) and compare the results to the independent SVGP case.

[352] Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

Main category: cs.LG

TL;DR: The paper introduces Concept Ablation Fine-Tuning (CAFT), a method that uses interpretability tools to control LLM generalization during fine-tuning by ablating undesired concepts in latent space, without requiring training data modifications.

Details

Motivation: Fine-tuning large language models can cause unintended out-of-distribution generalization, and standard approaches requiring training data modification are not always practical. There's a need for methods to control LLM generalization without modifying training data or using target distribution data.

Method: CAFT leverages interpretability tools to identify directions in LLM latent space corresponding to undesired concepts, then uses linear projections to ablate these concepts during fine-tuning, steering the model away from unintended generalizations.

Result: CAFT was successfully applied to three fine-tuning tasks including emergent misalignment. It reduced misaligned responses by 10x without degrading performance on the training distribution, all without requiring changes to fine-tuning data.

Conclusion: CAFT represents a novel approach for steering LLM generalization during fine-tuning without the need to modify training data, offering a practical solution to control unintended out-of-distribution generalization in large language models.

Abstract: Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM’s latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

[353] Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

Main category: cs.LG

TL;DR: This paper introduces RLCR (Reinforcement Learning with Calibration Rewards), a method that trains language models to generate both accurate predictions and well-calibrated confidence estimates by augmenting binary correctness rewards with Brier scores during reinforcement learning.

Details

Motivation: Current RL training for reasoning uses binary reward functions that improve accuracy but degrade calibration and increase hallucination rates. The authors aim to develop models that are both accurate and well-calibrated in their confidence estimates to create more reliable reasoning systems.

Method: RLCR modifies the RL training process by having language models generate both predictions and numerical confidence estimates, then optimizing a reward function that combines binary correctness scores with Brier scores (a proper scoring rule for confidence calibration). The method uses bounded, proper scoring rules to incentivize calibrated predictions.

Result: RLCR substantially improves calibration without sacrificing accuracy across diverse datasets, both in-domain and out-of-domain. It outperforms ordinary RL training and post-hoc confidence assignment methods. The verbalized confidence can be used at test time to further improve accuracy and calibration through confidence-weighted scaling.

Conclusion: Explicitly optimizing for calibration during RL training produces more reliable reasoning models. RLCR successfully addresses the calibration-accuracy trade-off problem in RL-trained language models, demonstrating that proper scoring rules can yield models that are both accurate and well-calibrated.

Abstract: When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score – a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations – outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

[354] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

Main category: cs.LG

TL;DR: SOPHIA introduces a semi-off-policy reinforcement learning approach that enhances large vision-language models with slow-thinking reasoning capabilities by combining on-policy visual understanding with off-policy reasoning from language models, achieving state-of-the-art performance on multimodal reasoning benchmarks.

Details

Motivation: Current large vision-language models (LVLMs) struggle with complex multimodal reasoning tasks because they are primarily trained for vision-language alignment. On-policy RL is limited by the model's initial abilities, while off-policy RL using external model trajectories can cause visual hallucinations due to mismatched visual perception capabilities across different models.

Method: SOPHIA uses a semi-off-policy behavior model that combines on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model. The method assigns outcome-based rewards to reasoning processes, propagates visual rewards backward, and trains the LVLM to learn slow-thinking reasoning from these trajectories using off-policy RL algorithms.

Result: SOPHIA improves InternVL3.0-38B by 8.50% on average across multimodal reasoning benchmarks, achieving state-of-the-art performance among open-source LVLMs. It outperforms closed-source models like GPT-4.1 on challenging benchmarks, reaching 49.08% pass@1 accuracy on MathVision and 49.95% on OlympiadBench. The method also outperforms supervised fine-tuning and direct on-policy RL approaches.

Conclusion: SOPHIA successfully addresses the limitations of existing RL approaches for LVLMs by providing an effective semi-off-policy framework that enhances slow-thinking reasoning capabilities while avoiding visual hallucinations. The method offers better policy initialization for further on-policy training and demonstrates significant improvements in multimodal reasoning performance.

Abstract: Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

[355] Learning from Data Streams: An Overview and Update

Jesse Read, Indrė Žliobaitė

Main category: cs.LG

TL;DR: This paper critiques current machine learning approaches for data streams, arguing that many fundamental assumptions are unrealistic or contradictory, and proposes reformulating supervised data-stream learning definitions while shifting research focus from artificial constraints to practical concerns like robustness, privacy, and interpretability.

Details

Motivation: The current literature on data-stream learning is based on overly strong or contradictory assumptions that don't hold in practice. Algorithms are designed for poorly defined problem settings, tested unrealistically, and isolated from broader literature, which limits their real-world impact and propagates misguided research directions.

Method: The authors reformulate fundamental definitions and settings of supervised data-stream learning considering contemporary views on concept drift and temporal dependence. They conduct an informal survey of industrial players dealing with real-world data streams and provide a fresh perspective on what constitutes supervised data-stream learning tasks and applicable algorithms.

Result: The analysis reveals that learning from data streams doesn’t necessarily require single-pass or online-learning approaches, and memory/time constraints aren’t specific to streaming contexts. Established techniques for handling temporal dependence and concept drift already exist in other literature areas.

Conclusion: The paper recommends shifting research focus from artificial constraints and assumptions about learning modes to more practically relevant issues such as robustness, privacy, and interpretability that are increasingly important in both academic and industrial data stream applications.

Abstract: The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.

[356] Energy-Efficient and Real-Time Sensing for Federated Continual Learning via Sample-Driven Control

Minh Ngoc Luu, Minh-Duong Nguyen, Ebrahim Bedeer, Van Duc Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham

Main category: cs.LG

TL;DR: This paper proposes SCFL, a sample-driven control technique for federated continual learning in real-time sensing systems that reduces energy consumption by up to 85% while maintaining performance by optimizing sampling processes using a novel soft actor-critic algorithm.

Details

Motivation: Real-Time Sensing (RTS) systems face significant challenges in federated continual learning due to diverse data characteristics that severely impact computational and communication resources, escalate energy costs, and degrade system performance. The paper addresses how data distribution shifts from ideal to practical RTS scenarios affect AI model performance.

Method: The authors develop SCFL (Sample-driven Control for Federated Continual Learning) which formulates an optimization problem that uses the sampling process to minimize generalization gap and improve accuracy while maintaining energy efficiency. They introduce a soft actor-critic algorithm with explicit and implicit constraints (A2C-EI) to solve the complex time-varying optimization problem.

Result: Empirical experiments show SCFL achieves higher efficiency compared to other deep reinforcement learning baselines. The technique can significantly reduce energy consumption by up to 85% while maintaining federated learning convergence and ensuring timely data transmission.

Conclusion: SCFL successfully addresses the energy efficiency challenges in federated continual learning for real-time sensing systems by leveraging intelligent sampling control, demonstrating substantial energy savings without compromising learning performance or communication timeliness.

Abstract: An intelligent Real-Time Sensing (RTS) system must continuously acquire, update, integrate, and apply knowledge to adapt to real-world dynamics. Managing distributed intelligence in this context requires Federated Continual Learning (FCL). However, effectively capturing the diverse characteristics of RTS data in FCL systems poses significant challenges, including severely impacting computational and communication resources, escalating energy costs, and ultimately degrading overall system performance. To overcome these challenges, we investigate how the data distribution shift from ideal to practical RTS scenarios affects Artificial Intelligence (AI) model performance by leveraging the \textit{generalization gap} concept. In this way, we can analyze how sampling time in RTS correlates with the decline in AI performance, computation cost, and communication efficiency. Based on this observation, we develop a novel Sample-driven Control for Federated Continual Learning (SCFL) technique, specifically designed for mobile edge networks with RTS capabilities. In particular, SCFL is an optimization problem that harnesses the sampling process to concurrently minimize the generalization gap and improve overall accuracy while upholding the energy efficiency of the FCL framework. To solve the highly complex and time-varying optimization problem, we introduce a new soft actor-critic algorithm with explicit and implicit constraints (A2C-EI). Our empirical experiments reveal that we can achieve higher efficiency compared to other DRL baselines. Notably, SCFL can significantly reduce energy consumption up to $85%$ while maintaining FL convergence and timely data transmission.

[357] Practical Insights into Knowledge Distillation for Pre-Trained Models

Norah Alballa, Ahmed M. Abdelmoniem, Marco Canini

Main category: cs.LG

TL;DR: This paper conducts a comprehensive comparison of knowledge distillation (KD) techniques for pre-trained models in federated learning environments, examining different KD methods and hyperparameter settings to optimize performance while reducing communication overhead.

Details

Motivation: Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models in distributed and federated learning environments, there is a lack of comprehensive understanding of KD's application in these scenarios. The research aims to fill this gap by systematically comparing different KD techniques and identifying optimal contexts for each.

Method: The study performs extensive comparison of multiple KD techniques including standard KD, tuned KD (optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. The evaluation involves detailed hyperparameter tuning through grid search across various data distribution strategies to identify the most effective contexts for each method.

Result: The research identifies optimal hyperparameter settings for distinct data partitioning scenarios and demonstrates how KD can improve federated learning by minimizing communication rounds and expediting the training process. The findings provide insights into when hyperparameter adjustments are crucial for enhancing model performance.

Conclusion: The study provides a practical framework for leveraging knowledge distillation in pre-trained models within collaborative and federated learning frameworks, offering guidance on optimal configurations for different scenarios and highlighting KD’s potential to reduce communication demands while maintaining model performance.

Abstract: This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models, an emerging field in knowledge transfer with significant implications for distributed training and federated learning environments. These environments benefit from reduced communication demands and accommodate various model architectures. Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD’s application in these scenarios is lacking. Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. We assess these methods across various data distribution strategies to identify the most effective contexts for each. Through detailed examination of hyperparameter tuning, informed by extensive grid search evaluations, we pinpoint when adjustments are crucial to enhance model performance. This paper sheds light on optimal hyperparameter settings for distinct data partitioning scenarios and investigates KD’s role in improving federated learning by minimizing communication rounds and expediting the training process. By filling a notable void in current research, our findings serve as a practical framework for leveraging KD in pre-trained models within collaborative and federated learning frameworks.

[358] DP-TLDM: Differentially Private Tabular Latent Diffusion Model

Chaoyi Zhu, Jiayi Tang, Juan F. Pérez, Marten van Dijk, Lydia Y. Chen

Main category: cs.LG

TL;DR: The paper proposes DPTLDM, a differentially private tabular latent diffusion model that generates synthetic data with better privacy-utility tradeoff than existing methods, achieving 35% improvement in data resemblance while maintaining privacy guarantees.

Details

Motivation: Existing synthetic data generation methods face a challenge in balancing high data quality with low privacy risk. Prior work has limited focus on tabular synthesizers and overlooks membership inference attacks and differential privacy defense strategies, creating a need for better privacy-preserving synthetic data generation.

Method: DPTLDM combines an autoencoder network to encode tabular data with a latent diffusion model to synthesize latent tables. The method applies DP-SGD with batch clipping to train the autoencoder under the f-DP framework, using separation value as the privacy metric to better capture privacy gains from differential privacy algorithms.

Result: DPTLDM achieves significant improvements over other DP-protected tabular generative models: 35% improvement in data resemblance, 15% improvement in utility for downstream tasks, and 50% improvement in data discriminability, while maintaining comparable privacy risk levels.

Conclusion: DPTLDM successfully addresses the privacy-utility tradeoff in synthetic tabular data generation by providing meaningful theoretical privacy guarantees while significantly enhancing synthetic data utility compared to existing differentially private methods.

Abstract: Synthetic data from generative models emerges as the privacy-preserving data sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. Till date, the prior focus on limited types of tabular synthesizers and a small number of privacy attacks, particularly on Generative Adversarial Networks, and overlooks membership inference attacks and defense strategies, i.e., differential privacy. Motivated by the conundrum of keeping high data quality and low privacy risk of synthetic data tables, we propose DPTLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DPTLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DPTLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

[359] Learning Neural Differential Algebraic Equations via Operator Splitting

James Koch, Madelyn Shapiro, Himanshu Sharma, Draguna Vrabie, Jan Drgona

Main category: cs.LG

TL;DR: This paper presents an Operator Splitting (OS) numerical integration scheme for learning unknown components of differential algebraic equations (DAEs) from time-series data, with applications to tank-manifold dynamics and pump-tank-pipe networks.

Details

Motivation: Many physical systems are governed by differential algebraic equations (DAEs) that contain both differential and algebraic constraints, including implicit relationships like conservation laws. There is a need for data-driven methods that can learn unknown components of these systems from time-series data while respecting the underlying mathematical structure.

Method: The authors develop an Operator Splitting (OS) numerical integration scheme specifically designed for learning unknown components of DAEs from time-series data. The method is designed to handle systems with implicit relationships between components and can perform system-theoretic data-driven modeling tasks.

Result: The proposed method demonstrates robustness to noise and good extrapolation ability. It successfully learns system component behaviors and interaction physics in two key applications: (i) solving inverse problems in tank-manifold dynamics and (ii) discrepancy modeling in networks of pumps, tanks, and pipes. The method can effectively distinguish between data trends and mechanistic relationships in the system.

Conclusion: The OS-based time-stepping scheme is effective for data-driven modeling of DAE systems. It can learn unknown system components while maintaining physical consistency and demonstrates strong performance in real-world applications involving fluid dynamics and network systems.

Abstract: Differential algebraic equations (DAEs) describe the temporal evolution of systems that obey both differential and algebraic constraints. Of particular interest are systems that contain implicit relationships between their components, such as conservation laws. Here, we present an Operator Splitting (OS) numerical integration scheme for learning unknown components of DAEs from time-series data. In this work, we show that the proposed OS-based time-stepping scheme is suitable for relevant system-theoretic data-driven modeling tasks. Presented examples include (i) the inverse problem of tank-manifold dynamics and (ii) discrepancy modeling of a network of pumps, tanks, and pipes. Our experiments demonstrate the proposed method’s robustness to noise and extrapolation ability to (i) learn the behaviors of the system components and their interaction physics and (ii) disambiguate between data trends and mechanistic relationships contained in the system.

[360] Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

Siting Li, Pang Wei Koh, Simon Shaolei Du

Main category: cs.LG

TL;DR: This paper investigates why CLIP models underperform on visual reasoning tasks and finds that Generative Multimodal Large Language Models (MLLMs) using the same vision encoder achieve significantly better performance, attributing this to architectural design choices rather than data or encoder limitations.

Details

Motivation: CLIP models struggle with visual reasoning tasks requiring compositional grounding, spatial understanding, and fine-grained detail capture. The authors wanted to determine whether this limitation stems from the vision encoder's inability to capture essential information or from architectural design issues in how CLIP extracts and utilizes visual information.

Method: The researchers conducted controlled experiments comparing CLIP with Generative MLLMs that use the same vision encoder and weights. They systematically analyzed different architectural components including patch tokens, position embeddings, prompt-based weighting, training data enhancement, text encoder strength, and additional text tokens. They also tested whether fine-grained visual reasoning capabilities persist when MLLMs are converted to CLIP-like encoders through contrastive finetuning.

Result: Generative MLLMs significantly outperformed CLIP on visual reasoning tasks using identical vision encoders. The superior performance was attributed to key architectural design choices (patch tokens, position embeddings, prompt-based weighting) rather than enhanced training data or stronger text encoders. Even when converted to CLIP-like encoders via contrastive finetuning, MLLMs maintained better performance than original CLIP models under cosine similarity evaluation.

Conclusion: The study demonstrates that CLIP’s limitations in visual reasoning are primarily due to architectural design choices rather than vision encoder capabilities. The findings highlight the critical importance of VLM architectural decisions and provide guidance for improving CLIP-like contrastive vision-language models through better extraction and utilization of visual information.

Abstract: Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the same vision encoder and weights, indicating that these Generative MLLMs perceive more – as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.

[361] Unisolver: PDE-Conditional Transformers Are Universal PDE Solvers

Hang Zhou, Yuezhou Ma, Haixu Wu, Haowen Wang, Mingsheng Long

Main category: cs.LG

TL;DR: Unisolver is a universal Transformer-based neural PDE solver that can handle diverse partial differential equations by embedding PDE components as conditions, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Existing neural PDE solvers are limited to specific instances of PDEs with restricted coefficients, preventing them from being practical surrogate models for numerical solvers due to poor generalization across diverse PDEs.

Method: The authors develop Unisolver, a Transformer model that embeds a complete set of PDE components (equation symbols, boundary conditions) as domain-wise and point-wise deep conditions, trained on diverse data and conditioned on diverse PDEs based on theoretical analysis of the PDE-solving process.

Result: Unisolver achieves consistent state-of-the-art performance on three challenging large-scale benchmarks, demonstrating impressive performance and generalizability across different types of PDEs.

Conclusion: By integrating physical insights with Transformer advances and embedding PDE components as conditions rather than purely scaling data and parameters, Unisolver successfully creates a universal neural PDE solver with superior generalization capabilities.

Abstract: Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available at https://github.com/thuml/Unisolver.

[362] Graph Neural Networks Gone Hogwild

Olga Solodova, Nick Richardson, Deniz Oktay, Ryan P. Adams

Main category: cs.LG

TL;DR: This paper proposes “energy GNNs”, a novel implicitly-defined graph neural network architecture that remains robust to asynchronous node updates during inference, addressing a critical limitation of traditional GNNs in distributed multi-agent systems.

Details

Motivation: Traditional GNNs fail catastrophically when nodes update asynchronously during inference, which excludes them from many real-world applications like robotic swarms or sensor networks where synchronous updates are difficult or impossible to enforce.

Method: The authors identify “implicitly-defined” GNNs as a class of architectures provably robust to asynchronous “hogwild” inference by adapting convergence guarantees from asynchronous and distributed optimization theory. They then develop a novel “energy GNN” architecture within this class.

Result: The proposed energy GNN architecture outperforms other implicitly-defined GNNs on various synthetic tasks inspired by multi-agent systems, demonstrating superior performance while maintaining robustness to asynchronous updates.

Conclusion: Energy GNNs successfully address the asynchrony problem in graph neural networks, providing a practical solution for distributed multi-agent systems where synchronous node updates cannot be guaranteed, thus expanding the applicability of GNNs to real-world scenarios.

Abstract: Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excludes these architectures from many potential applications where synchrony is difficult or impossible to enforce, e.g., robotic swarms or sensor networks. In this work we identify “implicitly-defined” GNNs as a class of architectures which is provably robust to asynchronous “hogwild” inference, adapting convergence guarantees from work in asynchronous and distributed optimization. We then propose a novel implicitly-defined GNN architecture, which we call an ’energy GNN’. We show that this architecture outperforms other GNNs from this class on a variety of synthetic tasks inspired by multi-agent systems.

[363] Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li

Main category: cs.LG

TL;DR: The paper proposes Q-BLoRA and QA-BLoRA methods to address performance degradation when fine-tuning quantized Large Language Models with Low-Rank Adaptation by balancing adapter complexity and trainability.

Details

Motivation: Existing approaches that combine parameter quantization with LoRA for fine-tuning LLMs suffer from performance degradation due to an imbalance between overly complex adapter inputs/outputs and low effective trainability, leading to underfitting during fine-tuning.

Method: The paper introduces Q-BLoRA (Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation) which simplifies adapter inputs/outputs while increasing adapter rank, and QA-BLoRA (Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation) which aligns with block-wise quantization for low-precision deployment.

Result: Q-BLoRA consistently achieves state-of-the-art accuracy compared to baselines, while QA-BLoRA enables direct generation of low-precision inference models with significant performance improvements over other low-precision models across various models and scenarios.

Conclusion: The proposed Q-BLoRA and QA-BLoRA methods effectively solve the underfitting problem in quantized LLM fine-tuning by balancing adapter complexity and trainability, providing superior performance for both standard and low-precision model deployment.

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), reducing memory usage but causing performance degradation. Additionally, converting fine-tuned models to low-precision representations further degrades performance. In this paper, we identify an imbalance in fine-tuning quantized LLMs with LoRA: overly complex adapter inputs and outputs versus low effective trainability of the adapter, leading to underfitting during fine-tuning. Thus, we propose Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation (Q-BLoRA), which simplifies the adapter’s inputs and outputs while increasing the adapter’s rank to alleviate underfitting during fine-tuning. For low-precision deployment, we propose Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation (QA-BLoRA), which aligns with the block-wise quantization and facilitates quantization-aware fine-tuning of low-rank adaptation based on the parameter merging of Q-BLoRA. Both Q-BLoRA and QA-BLoRA are easily implemented and offer the following optimizations: (i) Q-BLoRA consistently achieves state-of-the-art accuracy compared to baselines and other variants; (ii) QA-BLoRA enables the direct generation of low-precision inference models, which exhibit significant performance improvements over other low-precision models. We validate the effectiveness of Q-BLoRA and QA-BLoRA across various models and scenarios. Code will be made available at \href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}

[364] FLAIN: Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons

Binbin Ding, Penghui Yang, Sheng-Jun Huang

Main category: cs.LG

TL;DR: This paper proposes FLAIN, a defense method against backdoor attacks in federated learning that identifies and flips weight updates of low-activation input neurons to neutralize backdoors while preserving clean data performance.

Details

Motivation: Federated learning is vulnerable to backdoor attacks from malicious clients since the central server cannot directly monitor local training processes. Existing research shows backdoor attacks exploit specific neurons activated only by malicious inputs, creating a security vulnerability that needs to be addressed.

Method: The proposed FLAIN method works by: (1) using an auxiliary dataset after global training completion to identify low-activation input neurons, (2) iteratively flipping the weight updates associated with these neurons, and (3) progressively raising the threshold for low-activation neurons until model performance on auxiliary data starts to degrade significantly.

Result: Extensive experiments show that FLAIN effectively reduces backdoor attack success rates across various scenarios including Non-IID data distributions and high malicious client ratios, while maintaining minimal impact on clean data performance.

Conclusion: FLAIN provides an effective defense mechanism against backdoor attacks in federated learning by targeting the specific neural pathways exploited by attackers, successfully neutralizing backdoors while preserving model utility for legitimate tasks.

Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models under the coordination of a central server, while maintaining privacy. However, the server cannot directly monitor the local training processes, leaving room for malicious clients to introduce backdoors into the model. Research has shown that backdoor attacks exploit specific neurons that are activated only by malicious inputs, remaining dormant with clean data. Building on this insight, we propose a novel defense method called Flipping Weight Updates of Low-Activation Input Neurons (FLAIN) to counter backdoor attacks in FL. Specifically, upon the completion of global training, we use an auxiliary dataset to identify low-activation input neurons and iteratively flip their associated weight updates. This flipping process continues while progressively raising the threshold for low-activation neurons, until the model’s performance on the auxiliary data begins to degrade significantly. Extensive experiments demonstrate that FLAIN effectively reduces the success rate of backdoor attacks across a variety of scenarios, including Non-IID data distributions and high malicious client ratios (MCR), while maintaining minimal impact on the performance of clean data.

[365] Streamlining Prediction in Bayesian Deep Learning

Rui Li, Marcus Klasson, Arno Solin, Martin Trapp

Main category: cs.LG

TL;DR: This paper proposes a method to streamline Bayesian deep learning predictions through a single forward pass without Monte Carlo sampling, using local linearization and Gaussian approximations to analytically compute posterior predictive distributions.

Details

Motivation: Current Bayesian deep learning methods rely heavily on Monte Carlo integration for predictions, which is computationally inefficient. There is a need for faster inference methods that can provide uncertainty estimates without the computational overhead of sampling-based approaches.

Method: The authors use local linearization on activation functions combined with local Gaussian approximations at linear layers to enable analytical computation of the posterior predictive distribution. This eliminates the need for Monte Carlo sampling during inference, requiring only a single forward pass.

Result: The approach is demonstrated on both MLPs and transformer architectures (ViT and GPT-2), showing effectiveness on regression and classification tasks. An open-source library (SUQ) is provided for implementation.

Conclusion: The proposed method successfully enables efficient Bayesian inference in deep learning by replacing computationally expensive Monte Carlo integration with analytical approximations, making Bayesian deep learning more practical for real-world applications while maintaining uncertainty quantification capabilities.

Abstract: The rising interest in Bayesian deep learning (BDL) has led to a plethora of methods for estimating the posterior distribution. However, efficient computation of inferences, such as predictions, has been largely overlooked with Monte Carlo integration remaining the standard. In this work we examine streamlining prediction in BDL through a single forward pass without sampling. For this we use local linearisation on activation functions and local Gaussian approximations at linear layers. Thus allowing us to analytically compute an approximation to the posterior predictive distribution. We showcase our approach for both MLP and transformers, such as ViT and GPT-2, and assess its performance on regression and classification tasks. Open-source library: https://github.com/AaltoML/SUQ

[366] Physical models realizing the transformer architecture of large language models

Zeqian Chen

Main category: cs.LG

TL;DR: This paper proposes a theoretical framework that models transformer-based large language models as open quantum systems in Fock space, aiming to provide a physical understanding of how transformers work on modern nanoscale chips.

Details

Motivation: There is a gap in theoretical understanding of what transformers are and how they work physically, especially considering that modern chips (sub-28nm) should be viewed as open quantum systems rather than conventional statistical systems.

Method: The authors construct physical models that realize large language models based on transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens.

Result: The paper presents physical models that underlie the transformer architecture for large language models from a quantum mechanical perspective.

Conclusion: The transformer architecture can be theoretically understood and modeled as open quantum systems, providing a physical foundation for understanding how large language models operate on modern nanoscale hardware.

Abstract: The introduction of the transformer architecture in 2017 marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and how it works physically. From a physical perspective on modern chips, such as those chips under 28nm, modern intelligent machines should be regarded as open quantum systems beyond conventional statistical systems. Thereby, in this paper, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

[367] Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Arseniy Andreyev, Pierfrancesco Beneventano

Main category: cs.LG

TL;DR: This paper extends Cohen et al.’s findings on gradient descent stability to mini-batch SGD, showing that while full-batch GD stabilizes the largest Hessian eigenvalue at 2/η, mini-batch SGD operates in an “Edge of Stochastic Stability” regime where Batch Sharpness (expected directional curvature) stabilizes at 2/η instead.

Details

Motivation: Cohen et al. (2021) showed that full-batch gradient descent stabilizes the largest Hessian eigenvalue at 2/η, but this result doesn't apply to mini-batch SGD, limiting its broader applicability. The authors aim to understand the stability behavior of mini-batch SGD training.

Method: The authors analyze mini-batch stochastic gradient descent and define a new concept called “Batch Sharpness” - the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. They examine how this quantity behaves during SGD training.

Result: SGD operates in a regime called “Edge of Stochastic Stability” (EoSS) where Batch Sharpness stabilizes at 2/η rather than the largest eigenvalue. The largest eigenvalue is generally smaller than Batch Sharpness and gets suppressed, which explains why smaller batches and larger step sizes lead to flatter minima.

Conclusion: The paper provides a theoretical framework explaining SGD’s stability behavior that differs from full-batch gradient descent. This extends understanding of why mini-batch SGD finds flatter minima and has implications for mathematical modeling of SGD trajectories and understanding convergence/generalization properties.

Abstract: Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $\eta$, the largest eigenvalue $\lambda_{\max}$ of the full-batch Hessian consistently stabilizes at $\lambda_{\max} = 2/\eta$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/\eta$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $\lambda_{\max}$ – which is generally smaller than Batch Sharpness – is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.

[368] Soft Computing Approaches for Predicting Shade-Seeking Behaviour in Dairy Cattle under Heat Stress: A Comparative Study of Random Forests and Neural Networks

S. Sanjuan, D. A. Méndez, R. Arnau, J. M. Calabuig, X. Díaz de Otálora Aguirre, F. Estellés

Main category: cs.LG

TL;DR: This study uses machine learning (Random Forests and Neural Networks) to predict when dairy cattle seek shade during heat stress, achieving accurate predictions that could help farmers manage livestock welfare in hot Mediterranean climates.

Details

Motivation: Heat stress is a major welfare and productivity issue for dairy cattle in Mediterranean climates, requiring effective prediction tools to help farmers manage livestock during hot conditions and improve precision livestock farming.

Method: The researchers collected high-resolution behavioral and micro-climatic data from a commercial farm, then applied two soft computing algorithms (Random Forests and Neural Networks) to predict daily shade-seeking behavior using features derived from Temperature-Humidity Index (THI) measurements, evaluated through 5-fold cross-validation.

Result: Both machine learning models outperformed a baseline Decision Tree, with the best Neural Network achieving RMSE of 14.78 and Random Forest achieving 14.97. The models’ predictions deviated less than one hour from observed shade-seeking peaks, with median RMSE of 13.84.

Conclusion: Soft computing approaches are suitable for modeling noisy biological phenomena and demonstrate value as low-cost, real-time decision-support tools for precision livestock farming under heat-stress conditions, providing farmers with effective tools to manage cattle welfare during hot weather.

Abstract: Heat stress is one of the main welfare and productivity problems faced by dairy cattle in Mediterranean climates. In this study, we approach the prediction of the daily shade-seeking count as a non-linear multivariate regression problem and evaluate two soft computing algorithms – Random Forests and Neural Networks – trained on high-resolution behavioral and micro-climatic data collected in a commercial farm in Titaguas (Valencia, Spain) during the 2023 summer season. The raw dataset (6907 daytime observations, 5-10 min resolution) includes the number of cows in the shade, ambient temperature and relative humidity. From these we derive three features: current Temperature–Humidity Index (THI), accumulated daytime THI, and mean night-time THI. To evaluate the models’ performance a 5-fold cross-validation is also used. Results show that both soft computing models outperform a single Decision Tree baseline. The best Neural Network (3 hidden layers, 16 neurons each, learning rate = 10e-3) reaches an average RMSE of 14.78, while a Random Forest (10 trees, depth = 5) achieves 14.97 and offers best interpretability. Daily error distributions reveal a median RMSE of 13.84 and confirm that predictions deviate less than one hour from observed shade-seeking peaks. These results demonstrate the suitability of soft computing, data-driven approaches embedded in an applied-mathematical feature framework for modeling noisy biological phenomena, demonstrating their value as low-cost, real-time decision-support tools for precision livestock farming under heat-stress conditions.

[369] GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.LG

TL;DR: The paper introduces GUI-G², a novel reward framework that models GUI elements as continuous Gaussian distributions instead of binary targets, significantly improving GUI grounding performance by up to 24.7% over state-of-the-art methods.

Details

Motivation: Current reinforcement learning approaches for GUI grounding use binary rewards that create sparse signals and ignore the continuous nature of spatial interactions. Human clicking behavior naturally forms Gaussian distributions centered on target elements, suggesting a more principled approach is needed.

Method: GUI-G² uses two synergistic mechanisms: (1) Gaussian point rewards that model precise localization through exponentially decaying distributions centered on element centroids, and (2) coverage rewards that assess spatial alignment by measuring overlap between predicted Gaussian distributions and target regions. An adaptive variance mechanism calibrates reward distributions based on element dimensions.

Result: Extensive experiments on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks show GUI-G² substantially outperforms UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. The method demonstrates superior robustness to interface variations and enhanced generalization to unseen layouts.

Conclusion: The continuous Gaussian modeling approach transforms GUI grounding from sparse binary classification to dense continuous optimization, providing rich gradient signals that guide models toward optimal interaction positions and establishing a new paradigm for spatial reasoning in GUI interaction tasks.

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

[370] Learning to Bid in Non-Stationary Repeated First-Price Auctions

Zihao Hu, Xiaoyu Fan, Yuan Yao, Jiheng Zhang, Zhengyuan Zhou

Main category: cs.LG

TL;DR: This paper develops optimal learning algorithms for bidding in first-price auctions under non-stationary environments, achieving minimax-optimal dynamic regret rates using a novel Optimistic Mirror Descent framework with regularity constraints on opponents’ bidding patterns.

Details

Motivation: First-price auctions are increasingly important in digital advertising, but existing learning approaches assume static environments and use static benchmarks that perform poorly under non-stationarity. A dynamic benchmark that adapts to changing conditions is needed, but achieving no-regret learning against such benchmarks requires new theoretical frameworks and algorithms.

Method: The authors introduce two regularity metrics to quantify non-stationarity in opponents’ highest bids, then develop a novel Optimistic Mirror Descent (OMD) framework with a specialized optimism configuration. They provide minimax-optimal characterization of dynamic regret for bid sequences satisfying the regularity constraints.

Result: The paper achieves minimax-optimal dynamic regret rates for online first-price auctions under regularity constraints. Synthetic dataset experiments validate the theoretical guarantees and demonstrate superior performance compared to existing methods in non-stationary environments.

Conclusion: The proposed OMD-based approach successfully addresses the challenge of learning optimal bidding strategies in non-stationary first-price auction environments, providing both theoretical guarantees and practical improvements over existing static benchmark approaches.

Abstract: First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google’s transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one’s private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders, i.e., opponents) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark–representing the sum of the highest achievable rewards at each time step–offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the sequence of opponents’ highest bids, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret for the class of sequences of opponents’ highest bids that satisfy either of these regularity constraints. Our main technical tool is the Optimistic Mirror Descent (OMD) framework with a novel optimism configuration, which is well-suited for achieving minimax-optimal dynamic regret rates in this context. We then use synthetic datasets to validate our theoretical guarantees and demonstrate that our methods outperform existing ones.

[371] BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning

Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng

Main category: cs.LG

TL;DR: This paper introduces BioMaze, a dataset for evaluating LLMs’ pathway reasoning abilities in biological systems, and proposes PathSeeker, an LLM agent that uses interactive subgraph navigation to improve biological pathway reasoning performance.

Details

Motivation: Current LLMs have been applied to various biological domains, but their reasoning ability in complex biological systems like pathways remains underexplored. This capability is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments, creating a need for better evaluation and methods.

Method: The authors create BioMaze, a dataset with 5.1K complex pathway problems from real research covering natural dynamic changes, disturbances, interventions, and multi-scale targets. They evaluate existing methods like Chain-of-Thought (CoT) and graph-augmented reasoning, then propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.

Result: Evaluation shows that current LLMs struggle with pathway reasoning, particularly in perturbed biological systems. The proposed PathSeeker agent demonstrates improved performance by enabling more effective navigation through the complexities of biological systems in a scientifically aligned manner.

Conclusion: LLMs have significant limitations in biological pathway reasoning, especially under perturbation conditions. The interactive subgraph-based navigation approach of PathSeeker offers a promising solution for enhancing LLM reasoning capabilities in complex biological systems, providing a more scientifically aligned approach to biological pathway analysis.

Abstract: The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.

[372] Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization

Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu

Main category: cs.LG

TL;DR: This paper uses the tensor program framework to prove that infinitely wide neural networks under Maximal Update parametrization can learn linearly independent features that deviate from initialization while guaranteeing global convergence, overcoming limitations of neural tangent kernel theory.

Details

Motivation: Existing theoretical approaches like neural tangent kernel (NTK) are limited because features remain close to initialization, leaving gaps in understanding how networks achieve meaningful feature learning while maintaining global convergence guarantees during substantial feature evolution.

Method: The authors investigate training dynamics of infinitely wide L-layer neural networks using the tensor program (TP) framework with stochastic gradient descent under Maximal Update parametrization (μP), analyzing interactions among features across layers and properties of Gaussian random variables.

Result: Under mild conditions on activation functions, SGD enables networks to learn linearly independent features that substantially deviate from initial values, creating a rich feature space that captures relevant data information and ensures convergent points are global minima.

Conclusion: The theoretical analysis provides new insights into deep representation learning by showing how networks can simultaneously achieve meaningful feature learning and global convergence, with experimental validation on real-world datasets supporting the theoretical findings.

Abstract: Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

[373] Antithetic Sampling for Top-k Shapley Identification

Patrick Kolpaczki, Tim Nielen, Eyke Hüllermeier

Main category: cs.LG

TL;DR: This paper proposes CMCS, a sampling method for efficiently identifying the k most important features using Shapley values, which outperforms traditional uniform approximation approaches by leveraging correlated observations and multi-armed bandit techniques.

Details

Motivation: Computing Shapley values for feature importance is computationally expensive, and most existing methods waste computational resources by uniformly approximating all features' Shapley values, including insignificant ones. The authors argue that identifying only the top-k most important features is often sufficient and can be done more efficiently.

Method: The paper introduces Comparable Marginal Contributions Sampling (CMCS), which uses a novel sampling scheme that takes advantage of correlated observations and connects to multi-armed bandit algorithms to solve the top-k feature identification problem more efficiently than uniform approximation methods.

Result: Experimental results demonstrate that CMCS is more effective than competitive baselines for top-k feature identification. The study also reveals that estimation quality for approximate-all problems does not necessarily correlate with performance on top-k identification tasks.

Conclusion: The proposed CMCS method successfully addresses the computational limitations of Shapley value computation by focusing on top-k feature identification rather than uniform approximation, showing that different problems require different algorithmic approaches even within the same domain of feature importance estimation.

Abstract: Additive feature explanations rely primarily on game-theoretic notions such as the Shapley value by viewing features as cooperating players. The Shapley value’s popularity in and outside of explainable AI stems from its axiomatic uniqueness. However, its computational complexity severely limits practicability. Most works investigate the uniform approximation of all features’ Shapley values, needlessly consuming samples for insignificant features. In contrast, identifying the $k$ most important features can already be sufficiently insightful and yields the potential to leverage algorithmic opportunities connected to the field of multi-armed bandits. We propose Comparable Marginal Contributions Sampling (CMCS), a method for the top-$k$ identification problem utilizing a new sampling scheme taking advantage of correlated observations. We conduct experiments to showcase the efficacy of our method in compared to competitive baselines. Our empirical findings reveal that estimation quality for the approximate-all problem does not necessarily transfer to top-$k$ identification and vice versa.

[374] Feature Selection and Junta Testing are Statistically Equivalent

Lorenzo Beretta, Nathaniel Harms, Caleb Koch

Main category: cs.LG

TL;DR: This paper proves that junta testing (determining if a Boolean function depends on only k variables) and feature selection (finding which k variables) are statistically equivalent problems, with both optimally solved by a brute-force algorithm requiring Θ(1/ε(√(2^k log(n choose k)) + log(n choose k))) samples.

Details

Motivation: The paper addresses two fundamental problems in Boolean function analysis: determining whether a function is a k-junta (depends on only k out of n variables) and identifying which specific variables matter. Understanding the relationship between these problems and their optimal sample complexity is crucial for computational learning theory and feature selection applications.

Method: The authors analyze a “brute-force” algorithm that exhaustively checks all possible sets of k variables to see which set is consistent with the given sample data. They prove this algorithm is sample-optimal for both junta testing and feature selection by establishing matching upper and lower bounds on the required sample complexity.

Result: The main result shows that both junta testing and feature selection have the same optimal sample complexity of Θ(1/ε(√(2^k log(n choose k)) + log(n choose k))), where ε is the error parameter, k is the number of relevant variables, and n is the total number of variables. The brute-force algorithm achieves this optimal bound for both problems simultaneously.

Conclusion: Junta testing and feature selection are statistically equivalent problems with identical optimal sample complexity. The seemingly naive brute-force approach is actually optimal, requiring a number of samples that scales with the square root of the search space size plus a logarithmic term, providing tight characterization of these fundamental learning problems.

Abstract: For a function $f \colon {0,1}^n \to {0,1}$, the junta testing problem asks whether $f$ depends on only $k$ variables. If $f$ depends on only $k$ variables, the feature selection problem asks to find those variables. We prove that these two tasks are statistically equivalent. Specifically, we show that the ``brute-force’’ algorithm, which checks for any set of $k$ variables consistent with the sample, is simultaneously sample-optimal for both problems, and the optimal sample size is [ \Theta\left(\frac 1 \varepsilon \left( \sqrt{2^k \log {n \choose k}} + \log {n \choose k}\right)\right). ]

[375] Manifold Learning with Normalizing Flows: Towards Regularity, Expressivity and Iso-Riemannian Geometry

Willem Diepeveen, Deanna Needell

Main category: cs.LG

TL;DR: This paper addresses distortions and modeling errors in multi-modal data analysis by improving learned Riemannian geometry through isometrization and balanced diffeomorphism parametrization to enhance manifold learning performance.

Details

Motivation: Modern machine learning relies on the manifold hypothesis that high-dimensional data lie near low-dimensional non-linear manifolds. While learned pullback geometry has become scalable, real-world multi-modal data still presents challenges with distortions and modeling errors that need to be addressed for effective non-linear data analysis.

Method: The paper proposes two main approaches: (1) isometrizing the learned Riemannian structure to reduce distortions, and (2) balancing regularity and expressivity of the diffeomorphism parametrization to improve modeling accuracy in multi-modal settings.

Result: The effectiveness of the proposed synergistic approaches is demonstrated through numerical experiments on both synthetic and real data, showing improved performance in handling multi-modal data distortions and modeling errors.

Conclusion: The combination of isometrizing learned Riemannian structures and balanced diffeomorphism parametrization successfully addresses key challenges in multi-modal manifold learning, making principled non-linear data analysis more robust and interpretable for real-world applications.

Abstract: Modern machine learning increasingly leverages the insight that high-dimensional data often lie near low-dimensional, non-linear manifolds, an idea known as the manifold hypothesis. By explicitly modeling the geometric structure of data through learning Riemannian geometry algorithms can achieve improved performance and interpretability in tasks like clustering, dimensionality reduction, and interpolation. In particular, learned pullback geometry has recently undergone transformative developments that now make it scalable to learn and scalable to evaluate, which further opens the door for principled non-linear data analysis and interpretable machine learning. However, there are still steps to be taken when considering real-world multi-modal data. This work focuses on addressing distortions and modeling errors that can arise in the multi-modal setting and proposes to alleviate both challenges through isometrizing the learned Riemannian structure and balancing regularity and expressivity of the diffeomorphism parametrization. We showcase the effectiveness of the synergy of the proposed approaches in several numerical experiments with both synthetic and real data.

[376] Ownership Verification of DNN Models Using White-Box Adversarial Attacks with Specified Probability Manipulation

Teruki Sano, Minoru Kuribayashi, Masao Sakai, Shuji Ishobe, Eisuke Koizumi

Main category: cs.LG

TL;DR: This paper proposes a framework for verifying ownership of deep neural network models using adversarial attacks, where the rightful owner can prove model identity without revealing the original model by generating specific adversarial examples that produce designated output probabilities.

Details

Motivation: The need to verify ownership of DNN models in scenarios where unauthorized users illegally copy and deploy models in cloud environments, requiring a method that doesn't expose the original model while proving ownership.

Method: A white-box adversarial attack framework that aligns output probability of specific classes to designated values using an enhanced iterative Fast Gradient Sign Method (FGSM) with control parameters, leveraging the owner’s knowledge of the original model.

Result: Experimental results demonstrate the effectiveness of identifying DNN models through adversarial attacks, confirming that the proposed method can successfully verify model ownership in gray-box scenarios.

Conclusion: The proposed adversarial attack-based framework provides an effective solution for DNN model ownership verification, enabling both rightful owners and third parties to verify model identity without accessing the original model.

Abstract: In this paper, we propose a novel framework for ownership verification of deep neural network (DNN) models for image classification tasks. It allows verification of model identity by both the rightful owner and third party without presenting the original model. We assume a gray-box scenario where an unauthorized user owns a model that is illegally copied from the original model, provides services in a cloud environment, and the user throws images and receives the classification results as a probability distribution of output classes. The framework applies a white-box adversarial attack to align the output probability of a specific class to a designated value. Due to the knowledge of original model, it enables the owner to generate such adversarial examples. We propose a simple but effective adversarial attack method based on the iterative Fast Gradient Sign Method (FGSM) by introducing control parameters. Experimental results confirm the effectiveness of the identification of DNN models using adversarial attack.

[377] Note on Follow-the-Perturbed-Leader in Combinatorial Semi-Bandit Problems

Botao Chen, Junya Honda

Main category: cs.LG

TL;DR: This paper analyzes Follow-the-Perturbed-Leader (FTPL) policy in combinatorial semi-bandit problems, proving optimal regret bounds and developing a computationally efficient variant that reduces complexity from O(d²) to O(md(log(d/m)+1)) while maintaining performance.

Details

Motivation: While FTPL has been shown to achieve Best-of-Both-Worlds optimality in standard multi-armed bandit problems with Fréchet-type distributions, its optimality in combinatorial semi-bandit problems remained unclear and needed theoretical analysis.

Method: The authors analyze FTPL with geometric resampling (GR) in size-invariant semi-bandit settings and extend conditional geometric resampling (CGR) to this setting to improve computational efficiency while preserving regret performance.

Result: FTPL achieves O(√(m²d^(1/α)T) + √(mdT)) regret with Fréchet distributions and optimal O(√(mdT)) regret with Pareto distributions in adversarial settings. The extended CGR reduces computational complexity from O(d²) to O(md(log(d/m)+1)).

Conclusion: FTPL achieves optimal regret bounds in size-invariant combinatorial semi-bandit problems, and the proposed CGR extension significantly improves computational efficiency without sacrificing regret performance, making FTPL more practical for large-scale applications.

Abstract: This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in size-invariant combinatorial semi-bandit problems. Recently, Honda et al. (2023) and Lee et al. (2024) showed that FTPL achieves Best-of-Both-Worlds (BOBW) optimality in standard multi-armed bandit problems with Fr'{e}chet-type distributions. However, the optimality of FTPL in combinatorial semi-bandit problems remains unclear. In this paper, we consider the regret bound of FTPL with geometric resampling (GR) in size-invariant semi-bandit setting, showing that FTPL respectively achieves $O\left(\sqrt{m^2 d^\frac{1}{\alpha}T}+\sqrt{mdT}\right)$ regret with Fr'{e}chet distributions, and the best possible regret bound of $O\left(\sqrt{mdT}\right)$ with Pareto distributions in adversarial setting. Furthermore, we extend the conditional geometric resampling (CGR) to size-invariant semi-bandit setting, which reduces the computational complexity from $O(d^2)$ of original GR to $O\left(md\left(\log(d/m)+1\right)\right)$ without sacrificing the regret performance of FTPL.

[378] Learning Causally Predictable Outcomes from Psychiatric Longitudinal Data

Eric V. Strobl

Main category: cs.LG

TL;DR: DEBIAS is a novel algorithm that optimizes outcome definitions in longitudinal psychiatric data to maximize causal identifiability by learning interpretable weights for symptom aggregation, addressing the fundamental problem of unconfoundedness assumptions in treatment effect estimation.

Details

Motivation: Classical causal inference methods in psychiatry struggle with symptom heterogeneity and latent confounding, particularly because they assume fixed outcome variables and rely on unconfoundedness assumptions that often don't hold in practice with observed covariate adjustment alone.

Method: DEBIAS algorithm directly optimizes outcome definitions by learning non-negative, clinically interpretable weights for outcome aggregation. It maximizes durable treatment effects while minimizing observed and latent confounding by leveraging time-limited direct effects of prior treatments in psychiatric longitudinal data, and includes an empirically verifiable test for outcome unconfoundedness.

Result: DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in both depression and schizophrenia datasets.

Conclusion: The approach of optimizing outcome definitions rather than assuming fixed outcomes provides a promising solution to fundamental confounding challenges in psychiatric causal inference, offering both improved performance and clinical interpretability.

Abstract: Causal inference in longitudinal biomedical data remains a central challenge, especially in psychiatry, where symptom heterogeneity and latent confounding frequently undermine classical estimators. Most existing methods for treatment effect estimation presuppose a fixed outcome variable and address confounding through observed covariate adjustment. However, the assumption of unconfoundedness may not hold for a fixed outcome in practice. To address this foundational limitation, we directly optimize the outcome definition to maximize causal identifiability. Our DEBIAS (Durable Effects with Backdoor-Invariant Aggregated Symptoms) algorithm learns non-negative, clinically interpretable weights for outcome aggregation, maximizing durable treatment effects and empirically minimizing both observed and latent confounding by leveraging the time-limited direct effects of prior treatments in psychiatric longitudinal data. The algorithm also furnishes an empirically verifiable test for outcome unconfoundedness. DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in depression and schizophrenia.

[379] Towards a deeper GCN: Alleviate over-smoothing with iterative training and fine-tuning

Furong Peng, Jinzhen Gao, Xuan Lu, Kang Liu, Yifan Huo, Sheng Wang

Main category: cs.LG

TL;DR: This paper proposes Layer-wise Gradual Training (LGT), a novel training strategy for deep Graph Convolutional Networks that addresses over-smoothing by progressively building layers while preserving expressiveness through layer-wise training, low-rank adaptation, and identity initialization.

Details

Motivation: Graph Convolutional Networks suffer from severe performance degradation in deep architectures due to over-smoothing. While previous studies focused on graph Laplacian operators, this work identifies that trainable linear transformations significantly exacerbate feature collapse even at moderate depths, creating a trade-off between expressive power and over-smoothing that needs to be addressed.

Method: The paper proposes Layer-wise Gradual Training (LGT) with three components: (1) layer-wise training that progressively builds from shallow to deep layers to stabilize optimization, (2) low-rank adaptation for fine-tuning shallow layers and accelerating training, and (3) identity initialization to ensure smooth integration of new layers and faster convergence.

Result: LGT achieves state-of-the-art performance on vanilla GCN with significant accuracy improvements even in 32-layer settings. The method can be seamlessly combined with existing approaches like PairNorm and ContraNorm to further enhance performance in deeper networks, demonstrating its effectiveness as a general training framework.

Conclusion: LGT provides an architecture-agnostic training framework that successfully enables scalable deep GCNs by addressing the fundamental trade-off between expressiveness and over-smoothing through progressive layer construction, offering a practical solution for training very deep graph neural networks.

Abstract: Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at moderate depths (e.g., 8 layers). In contrast, Simplified Graph Convolution (SGC), which removes these transformations, maintains stable feature diversity up to 32 layers, highlighting linear transformations’ dual role in facilitating expressive power and inducing over-smoothing. However, completely removing linear transformations weakens the model’s expressive capacity. To address this trade-off, we propose Layer-wise Gradual Training (LGT), a novel training strategy that progressively builds deep GCNs while preserving their expressiveness. LGT integrates three complementary components: (1) layer-wise training to stabilize optimization from shallow to deep layers, (2) low-rank adaptation to fine-tune shallow layers and accelerate training, and (3) identity initialization to ensure smooth integration of new layers and accelerate convergence. Extensive experiments on benchmark datasets demonstrate that LGT achieves state-of-the-art performance on vanilla GCN, significantly improving accuracy even in 32-layer settings. Moreover, as a training method, LGT can be seamlessly combined with existing methods such as PairNorm and ContraNorm, further enhancing their performance in deeper networks. LGT offers a general, architecture-agnostic training framework for scalable deep GCNs. The code is available at [https://github.com/jfklasdfj/LGT_GCN].

[380] Neural Approaches for Multi-Objective Routing on Multigraphs

Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár

Main category: cs.LG

TL;DR: This paper introduces two graph neural network-based methods for multi-objective routing on multigraphs (graphs with multiple edges between node pairs), addressing a gap in existing routing methods that don’t handle such complex graph structures.

Details

Motivation: Existing learning-based routing methods cannot handle multigraphs, which have multiple edges with different attributes between node pairs, despite multigraphs being highly relevant in real-world routing scenarios. This creates a significant limitation for practical applications.

Method: Two GNN-based approaches: (1) Direct multigraph routing that autoregressively selects edges until completing a tour, and (2) A two-stage approach that first learns to prune the multigraph into a simpler graph, then performs routing on the simplified structure.

Result: Both models demonstrate strong empirical performance across various problem instances and distributions, successfully handling multi-objective routing on multigraphs where existing methods fail.

Conclusion: The proposed GNN-based methods effectively solve multi-objective routing on multigraphs, with both direct and pruning-based approaches showing strong performance, thus extending learning-based routing capabilities to more complex and realistic graph structures.

Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model first simplifies the multigraph via a learned pruning strategy and then performs routing on the resulting simple graph. We evaluate both models empirically and demonstrate their strong performance across a range of problems and distributions.

[381] FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee

Main category: cs.LG

TL;DR: FedWSQ is a novel federated learning framework that combines weight standardization and distribution-aware non-uniform quantization to address data heterogeneity and communication constraints, achieving better performance with reduced communication overhead.

Details

Motivation: Federated learning suffers from performance degradation due to data heterogeneity across clients and communication constraints that limit the efficiency of distributed training.

Method: The paper proposes FedWSQ framework that integrates two key components: (1) Weight standardization (WS) to filter out biased components in local updates and improve robustness against data heterogeneity, and (2) Distribution-aware non-uniform quantization (DANUQ) that leverages statistical properties of local model updates to minimize quantization errors.

Result: FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets show consistent outperformance over existing FL methods across various challenging settings including extreme data heterogeneity and ultra-low-bit communication scenarios.

Conclusion: The proposed FedWSQ framework effectively addresses the dual challenges of data heterogeneity and communication constraints in federated learning, demonstrating superior performance across challenging FL scenarios while achieving significant communication efficiency gains.

Abstract: Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.

[382] Harnessing Near-Infrared Spectroscopy and Machine Learning for Traceable Classification of Hanwoo and Holstein Beef

AMM Nurul Alam, Abdul Samad, AMM Shamsul Alam, Jahan Ara Monti, Ayesha Muazzam

Main category: cs.LG

TL;DR: This study demonstrates that Near-Infrared spectroscopy (NIRS) combined with machine learning can effectively differentiate between Hanwoo beef and Holstein beef for food authenticity detection, with Random Forest achieving the highest accuracy (ROC AUC of 0.8826).

Details

Motivation: To address food authenticity issues, mislabeling, and adulteration in beef products by developing a rapid and non-invasive method to distinguish between Hanwoo beef (HNB) and Holstein beef (HLB).

Method: Used portable NIRS to collect spectral data (700-1100 nm) from 40 beef samples, applied PCA for data analysis, and implemented 9 different machine learning models (LDA, SVM, LR, Random Forest, GB, KNN, DT, NB, NN) with hyperparameter tuning and 5-fold cross-validation.

Result: PCA successfully separated the two beef varieties with 93.72% variance explained. Random Forest achieved the best performance (ROC AUC: 0.8826), followed by SVM (0.8747). Neural Networks showed the highest recall (0.7804). LR and SVM provided the best balance of accuracy, precision, and recall.

Conclusion: The integration of NIRS with machine learning techniques provides a powerful and reliable method for meat authenticity verification, offering significant potential for detecting food fraud in the beef industry.

Abstract: This study evaluates the use of Near-Infrared spectroscopy (NIRS) combined with advanced machine learning (ML) techniques to differentiate Hanwoo beef (HNB) and Holstein beef (HLB) to address food authenticity, mislabeling, and adulteration. Rapid and non-invasive spectral data were attained by a portable NIRS, recording absorbance data within the wavelength range of 700 to 1100 nm. A total of 40 Longissimus lumborum samples, evenly split between HNB and HLB, were obtained from a local hypermarket. Data analysis using Principal Component Analysis (PCA) demonstrated distinct spectral patterns associated with chemical changes, clearly separating the two beef varieties and accounting for 93.72% of the total variance. ML models, including Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest, Gradient Boosting (GB), K-Nearest Neighbors, Decision Tree (DT), Naive Bayes (NB), and Neural Networks (NN), were implemented, optimized through hyperparameter tuning, and validated by 5-fold cross-validation techniques to enhance model robustness and prevent overfitting. Random Forest provided the highest predictive accuracy with a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) of 0.8826, closely followed by the SVM model at 0.8747. Furthermore, GB and NN algorithms exhibited satisfactory performances, with cross-validation scores of 0.752. Notably, the NN model achieved the highest recall rate of 0.7804, highlighting its suitability in scenarios requiring heightened sensitivity. DT and NB exhibited comparatively lower predictive performance. The LR and SVM models emerged as optimal choices by effectively balancing high accuracy, precision, and recall. This study confirms that integrating NIRS with ML techniques offers a powerful and reliable method for meat authenticity, significantly contributing to detecting food fraud.

[383] Beyond the ATE: Interpretable Modelling of Treatment Effects over Dose and Time

Julianna Piskorz, Krzysztof Kacprzyk, Harry Amad, Mihaela van der Schaar

Main category: cs.LG

TL;DR: This paper proposes a framework for modeling treatment effects as smooth surfaces over dose and time, adapting SemanticODE to the causal inference setting to extract clinically actionable insights like onset time, peak effect, and duration of benefit from treatment trajectories.

Details

Motivation: Traditional Average Treatment Effect (ATE) metrics provide only static summaries that fail to capture the dynamic nature of treatment effects varying with dose and time, particularly in healthcare applications where understanding treatment trajectories is crucial for clinical decision-making.

Method: The authors adapt SemanticODE, an interpretable trajectory modeling framework, to causal inference settings where treatment effects are unobserved. The method models treatment effect trajectories as smooth surfaces over dose and time, decoupling trajectory shape estimation from clinically relevant property specification, and supports domain-informed priors and post-hoc editing.

Result: The proposed method produces accurate, interpretable, and editable models of treatment dynamics that can extract clinically actionable insights such as onset time, peak effect, and duration of benefit while maintaining interpretability, robustness, and verifiability required in high-stakes domains.

Conclusion: The framework successfully enables both rigorous causal analysis and practical decision-making by providing a flexible, interpretable approach to modeling treatment effect trajectories that can be transparently analyzed and edited according to domain expertise.

Abstract: The Average Treatment Effect (ATE) is a foundational metric in causal inference, widely used to assess intervention efficacy in randomized controlled trials (RCTs). However, in many applications – particularly in healthcare – this static summary fails to capture the nuanced dynamics of treatment effects that vary with both dose and time. We propose a framework for modelling treatment effect trajectories as smooth surfaces over dose and time, enabling the extraction of clinically actionable insights such as onset time, peak effect, and duration of benefit. To ensure interpretability, robustness, and verifiability – key requirements in high-stakes domains – we adapt SemanticODE, a recent framework for interpretable trajectory modelling, to the causal setting where treatment effects are never directly observed. Our approach decouples the estimation of trajectory shape from the specification of clinically relevant properties (e.g., maxima, inflection points), supporting domain-informed priors, post-hoc editing, and transparent analysis. We show that our method yields accurate, interpretable, and editable models of treatment dynamics, facilitating both rigorous causal analysis and practical decision-making.

[384] OPC: One-Point-Contraction Unlearning Toward Deep Feature Forgetting

Jaeheun Jung, Bosung Jung, Suhyun Bae, Donghun Lee

Main category: cs.LG

TL;DR: This paper addresses the problem of “shallow forgetting” in machine unlearning methods, where models only superficially forget data while retaining internal representations. The authors propose One-Point-Contraction (OPC), a theoretically-grounded unlearning algorithm that achieves deep feature-level forgetting and demonstrates superior resistance to recovery attacks.

Details

Motivation: Existing machine unlearning methods suffer from shallow forgetting - they only adjust model responses while internal representations retain sufficient information to restore forgotten data. This vulnerability allows attackers to recover supposedly forgotten information through performance recovery attacks and gradient-inversion-based data reconstruction attacks, undermining privacy and legal compliance requirements.

Method: The paper defines a theoretical criterion for “deep forgetting” based on one-point-contraction of feature representations. They develop an efficient approximation algorithm and construct the One-Point-Contraction (OPC) unlearning method that enforces deep feature-level forgetting by contracting the representations of data to be forgotten to a single point in feature space.

Result: Empirical evaluations on image classification benchmarks demonstrate that OPC achieves effective unlearning performance while showing superior resilience against both performance recovery attacks and gradient-inversion attacks compared to existing methods. The deep feature forgetting enforced by OPC’s theoretical foundation leads to more robust unlearning.

Conclusion: The paper establishes that current unlearning methods are vulnerable due to shallow forgetting and demonstrates that theoretically-grounded deep forgetting through one-point-contraction can significantly improve the robustness of machine unlearning. This work highlights the critical need for improved robustness in machine unlearning methods to meet genuine privacy and security requirements.

Abstract: Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting’’ based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.

[385] Pre-Training LLMs on a budget: A comparison of three optimizers

Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, Fabian Küch

Main category: cs.LG

TL;DR: This study compares three optimizers (AdamW, Lion, and Sophia) for LLM pre-training across different architectures and training approaches, finding that while all perform similarly, each has distinct advantages: Sophia achieves lowest loss, Lion trains fastest, and AdamW performs best on downstream tasks.

Details

Motivation: To systematically compare major optimizer variants for LLM pre-training and determine which optimizer provides the best trade-offs between training efficiency and model performance, as optimizers play a crucial role in reducing pre-training times and achieving better-performing models.

Method: The researchers compared three optimizers (AdamW, Lion, and Sophia) using two different base architectures with both single- and multiple-epoch training approaches while keeping token count constant. They used Maximal Update Parametrization and smaller proxy models to tune hyperparameters separately for each optimizer-architecture combination.

Result: All three optimizers performed within approximately the same range. Sophia achieved the lowest training and validation loss, Lion was the fastest in terms of training GPU hours, while AdamW led to the best results on downstream evaluation tasks.

Conclusion: The choice of optimizer involves trade-offs: Sophia for lowest loss, Lion for fastest training, and AdamW for best downstream performance. No single optimizer dominates across all metrics, suggesting that optimizer selection should depend on specific priorities and constraints of the training scenario.

Abstract: Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.

[386] Leveraging Distribution Matching to Make Approximate Machine Unlearning Faster

Junaid Iqbal Khan

Main category: cs.LG

TL;DR: This paper proposes two complementary methods to accelerate approximate machine unlearning (AMU): Blend, a distribution-matching dataset condensation technique that reduces retained dataset size, and Accelerated-AMU (A-AMU), a loss-centric method that speeds up convergence through modified objectives.

Details

Motivation: Approximate machine unlearning requires processing retained dataset subsets which dominates computational runtime, and reducing training epochs remains challenging. Current AMU methods are computationally expensive and slow to converge.

Method: Two complementary approaches: (1) Blend - a novel distribution-matching dataset condensation that merges visually similar images with shared blend-weights to reduce retained set size with minimal preprocessing overhead; (2) A-AMU - augments unlearning objective with steepened primary loss for expedited forgetting and a differentiable regularizer that matches loss distributions of forgotten and unseen data.

Result: The dual approach dramatically reduces end-to-end unlearning latency across single and multi-round scenarios while preserving model utility and privacy. Blend operates orders of magnitude faster than state-of-the-art dataset condensation methods.

Conclusion: This is the first work to systematically tackle unlearning efficiency by jointly designing specialized dataset condensation with dedicated accelerated loss function, achieving significant speedups in machine unlearning while maintaining performance and privacy guarantees.

Abstract: Approximate machine unlearning (AMU) enables models to `forget’ specific training data through specialized fine-tuning on a retained dataset subset. However, processing this retained subset still dominates computational runtime, while reductions of epochs also remain a challenge. We propose two complementary methods to accelerate classification-oriented AMU. First, \textbf{Blend}, a novel distribution-matching dataset condensation (DC), merges visually similar images with shared blend-weights to significantly reduce the retained set size. It operates with minimal pre-processing overhead and is orders of magnitude faster than state-of-the-art DC methods. Second, our loss-centric method, \textbf{Accelerated-AMU (A-AMU)}, augments the unlearning objective to quicken convergence. A-AMU achieves this by combining a steepened primary loss to expedite forgetting with a novel, differentiable regularizer that matches the loss distributions of forgotten and in-distribution unseen data. Our extensive experiments demonstrate that this dual approach of data and loss-centric optimization dramatically reduces end-to-end unlearning latency across both single and multi-round scenarios, all while preserving model utility and privacy. To our knowledge, this is the first work to systematically tackle unlearning efficiency by jointly designing a specialized dataset condensation technique with a dedicated accelerated loss function. Code is available at https://github.com/algebraicdianuj/DC_Unlearning.

[387] Rethinking Inductive Bias in Geographically Neural Network Weighted Regression

Zhenyuan Chen

Main category: cs.LG

TL;DR: This paper generalizes Geographically Neural Network Weighted Regression (GNNWR) by incorporating inductive biases from CNNs, RNNs, and transformers to better capture spatial non-stationarity, showing superior performance over classic methods on synthetic datasets with varying characteristics.

Details

Motivation: Current GNNWR approaches have limitations in modeling spatial non-stationarity due to fixed distance-based schemes and limited inductive bias. The authors aim to improve spatial regression models by incorporating better inductive biases that can learn from limited data and capture complex spatial patterns more effectively.

Method: The authors generalize GNNWR by incorporating concepts from three neural network architectures: (1) local receptive fields from convolutional neural networks, (2) sequential context from recurrent neural networks, and (3) self-attention mechanisms from transformers. This integration aims to enhance the spatial weighting functions beyond traditional distance-based approaches.

Result: Through extensive benchmarking on synthetic spatial datasets with varying heterogeneity, noise, and sample sizes, the generalized GNNWR outperforms classic methods in capturing nonlinear and complex spatial relationships. The results show that model performance strongly depends on data characteristics - local models excel with highly heterogeneous or small-sample scenarios, while global models perform better with larger, more homogeneous datasets.

Conclusion: The study demonstrates the critical importance of inductive bias in spatial modeling and suggests that incorporating neural network concepts into GNNWR significantly improves performance. The findings point toward future research directions including learnable spatial weighting functions, hybrid neural architectures, and improved interpretability for handling non-stationary spatial data.

Abstract: Inductive bias is a key factor in spatial regression models, determining how well a model can learn from limited data and capture spatial patterns. This work revisits the inductive biases in Geographically Neural Network Weighted Regression (GNNWR) and identifies limitations in current approaches for modeling spatial non-stationarity. While GNNWR extends traditional Geographically Weighted Regression by using neural networks to learn spatial weighting functions, existing implementations are often restricted by fixed distance-based schemes and limited inductive bias. We propose to generalize GNNWR by incorporating concepts from convolutional neural networks, recurrent neural networks, and transformers, introducing local receptive fields, sequential context, and self-attention into spatial regression. Through extensive benchmarking on synthetic spatial datasets with varying heterogeneity, noise, and sample sizes, we show that GNNWR outperforms classic methods in capturing nonlinear and complex spatial relationships. Our results also reveal that model performance depends strongly on data characteristics, with local models excelling in highly heterogeneous or small-sample scenarios, and global models performing better with larger, more homogeneous data. These findings highlight the importance of inductive bias in spatial modeling and suggest future directions, including learnable spatial weighting functions, hybrid neural architectures, and improved interpretability for models handling non-stationary spatial data.

[388] T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs

Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau

Main category: cs.LG

TL;DR: This paper introduces T-GRAB, a synthetic benchmark designed to systematically evaluate Temporal Graph Neural Networks’ ability to capture core temporal patterns like periodicity, causality, and long-range dependencies, revealing fundamental limitations in current models.

Details

Motivation: Despite extensive benchmarking efforts in dynamic graph learning, it remains unclear whether current Temporal Graph Neural Networks effectively capture essential temporal patterns such as periodicity, cause-and-effect relationships, and long-range dependencies in evolving relational data.

Method: The authors develop T-GRAB (Temporal Graph Reasoning Benchmark), a comprehensive set of controlled, interpretable synthetic tasks that systematically probe TGNNs’ temporal reasoning capabilities by isolating key skills: counting/memorizing periodic repetitions, inferring delayed causal effects, and capturing long-range dependencies across spatial and temporal dimensions.

Result: Evaluation of 11 temporal graph learning methods on T-GRAB tasks reveals fundamental shortcomings in their ability to generalize temporal patterns and highlights challenges that are hidden by traditional real-world benchmarks.

Conclusion: Current TGNNs have significant limitations in temporal reasoning capabilities, and the findings provide actionable insights into these limitations while motivating the development of architectures with stronger temporal reasoning abilities.

Abstract: Dynamic graph learning methods have recently emerged as powerful tools for modelling relational data evolving through time. However, despite extensive benchmarking efforts, it remains unclear whether current Temporal Graph Neural Networks (TGNNs) effectively capture core temporal patterns such as periodicity, cause-and-effect, and long-range dependencies. In this work, we introduce the Temporal Graph Reasoning Benchmark (T-GRAB), a comprehensive set of synthetic tasks designed to systematically probe the capabilities of TGNNs to reason across time. T-GRAB provides controlled, interpretable tasks that isolate key temporal skills: counting/memorizing periodic repetitions, inferring delayed causal effects, and capturing long-range dependencies over both spatial and temporal dimensions. We evaluate 11 temporal graph learning methods on these tasks, revealing fundamental shortcomings in their ability to generalize temporal patterns. Our findings offer actionable insights into the limitations of current models, highlight challenges hidden by traditional real-world benchmarks, and motivate the development of architectures with stronger temporal reasoning abilities. The code for T-GRAB can be found at: https://github.com/alirezadizaji/T-GRAB.

[389] Tri-Learn Graph Fusion Network for Attributed Graph Clustering

Binxiong Li, Xu Xiang, Xue Li, Binyu Zhao, Heyang Gao, Qinyu Zhao

Main category: cs.LG

TL;DR: This paper proposes Tri-GFN, a novel deep clustering framework that combines GCN, Autoencoder, and Graph Transformer to address over-smoothing and over-compression issues in graph clustering, achieving significant performance improvements on multiple datasets.

Details

Motivation: Graph Convolutional Networks face challenges with over-smoothing and over-compression when handling large-scale and complex graph datasets, leading to declined clustering quality. Graph Transformers have limited performance on heterogeneous graph data, creating a need for better approaches to graph clustering.

Method: The paper introduces Tri-Learn Graph Fusion Network (Tri-GFN), which integrates three components: GCN, Autoencoder, and Graph Transformer. The framework uses a tri-learning mechanism for mutual learning among modules and employs a triple-channel enhancement module for feature fusion, maximizing the use of both node attributes and topological structures.

Result: Tri-GFN achieves substantial accuracy improvements over state-of-the-art methods: approximately 0.87% on ACM dataset, 14.14% on Reuters dataset, and 7.58% on USPS dataset. The model demonstrates robust clustering representation and captures complex relationships effectively.

Conclusion: The proposed Tri-GFN framework successfully addresses the limitations of existing graph clustering methods by combining multiple architectures with innovative fusion strategies. Its outstanding performance, particularly on the Reuters dataset, makes it applicable to automatic news classification, topic retrieval, and related applications.

Abstract: In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.

[390] MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

Yaowei Jin, Junjie Wang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi

Main category: cs.LG

TL;DR: This paper proposes Parameter Interpolation Flow (PIF), a novel generative model for molecular generation that operates in parameter space, offering a more flexible alternative to Bayesian Flow Networks for drug discovery applications.

Details

Motivation: Bayesian Flow Networks (BFNs) show promise in molecular generation but have limitations in designing flexible distribution transformation pathways due to their Bayesian inference-based strategy, making it difficult to adapt to diverse data distributions and task requirements. The potential for simpler, more efficient parameter-space-based models remains unexplored.

Method: The authors propose Parameter Interpolation Flow (PIF), a novel parameter-space-based generative model with detailed theoretical foundation, training, and inference procedures. They develop MolPIF specifically for structure-based drug design, which operates by modeling in parameter space rather than using Bayesian inference strategies.

Result: MolPIF demonstrates superior performance across diverse metrics compared to baseline methods in structure-based drug design tasks, validating the effectiveness of the parameter-space-based generative modeling approach for molecular generation.

Conclusion: The work validates that parameter-space-based generative modeling is effective for molecular generation and offers new perspectives for model design in drug discovery. PIF provides a more flexible alternative to BFNs while maintaining the benefits of low-variance parameter space modeling.

Abstract: Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

[391] IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning

Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee

Main category: cs.LG

TL;DR: This paper proposes IPPRO, a novel neural network pruning method that uses projective space to make pruning decisions independent of filter magnitudes, challenging the “bigger is more important” assumption in existing pruning methods.

Details

Motivation: Existing importance-based pruning methods are dominated by magnitude criteria, where filters with larger magnitudes are unlikely to be pruned even if they are redundant, limiting the effectiveness of pruning decisions and creating bias in the pruning process.

Method: The authors propose placing filters in projective space to eliminate magnitude bias and introduce PROscore, which measures filter importance by observing gradient descent movement toward the origin. This creates IPPRO, a magnitude-indifferent importance-based structured pruning approach.

Result: The proposed method achieves near-lossless pruning with reduced performance drop compared to existing methods, and shows promising performance after finetuning, demonstrating that magnitude-independent pruning can be more effective.

Conclusion: The work successfully debunks the “size-matters” myth in neural network pruning and expands the theoretical and empirical frontier of importance-based pruning by showing that magnitude-indifferent approaches can achieve better pruning results.

Abstract: With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn’t, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters’’ myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.

[392] Feature Construction Using Network Control Theory and Rank Encoding for Graph Machine Learning

Anwar Said, Yifan Wei, Obaid Ullah Ahmad, Mudassir Shabbir, Waseem Abbas, Xenofon Koutsoukos

Main category: cs.LG

TL;DR: This paper proposes using average controllability and a novel rank encoding method to create better node features for Graph Neural Networks (GNNs) in social network classification, addressing the problem of missing node features due to privacy constraints.

Details

Motivation: Social networks often lack node features due to privacy constraints or absence of inherent attributes, which limits GNN performance since GNNs require expressive node features to function effectively in network-based learning applications.

Method: The authors propose two strategies: (1) NCT-EFA - incorporating average controllability along with other centrality metrics as node-level features that capture network topology, and (2) a rank encoding method that transforms average controllability or other graph-theoretic metrics into fixed-dimensional feature space for improved representation.

Result: Extensive evaluation on six benchmark GNN models across four social network datasets shows that incorporating average controllability significantly improves GNN performance. The rank encoding method outperforms traditional one-hot degree encoding, improving ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset.

Conclusion: The proposed average controllability-based features and rank encoding method effectively address the challenge of missing node features in social networks, providing a practical solution that significantly enhances GNN performance in social network classification tasks through more expressive and efficient node representations.

Abstract: In this article, we utilize the concept of average controllability in graphs, along with a novel rank encoding method, to enhance the performance of Graph Neural Networks (GNNs) in social network classification tasks. GNNs have proven highly effective in various network-based learning applications and require some form of node features to function. However, their performance is heavily influenced by the expressiveness of these features. In social networks, node features are often unavailable due to privacy constraints or the absence of inherent attributes, making it challenging for GNNs to achieve optimal performance. To address this limitation, we propose two strategies for constructing expressive node features. First, we introduce average controllability along with other centrality metrics (denoted as NCT-EFA) as node-level metrics that capture critical aspects of network topology. Building on this, we develop a rank encoding method that transforms average controllability or any other graph-theoretic metric into a fixed-dimensional feature space, thereby improving feature representation. We conduct extensive numerical evaluations using six benchmark GNN models across four social network datasets to compare different node feature construction methods. Our results demonstrate that incorporating average controllability into the feature space significantly improves GNN performance. Moreover, the proposed rank encoding method outperforms traditional one-hot degree encoding, improving the ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset, underscoring its effectiveness in generating expressive and efficient node representations.

[393] FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning

Baran Can Gül, Suraksha Nadig, Stefanos Tziampazis, Nasser Jazdi, Michael Weyrich

Main category: cs.LG

TL;DR: FedMultiEmo is a privacy-preserving federated learning framework that combines visual and physiological data for in-vehicle emotion recognition, achieving 87% accuracy while keeping sensitive data local and addressing practical deployment challenges in automotive settings.

Details

Motivation: In-vehicle emotion recognition faces three major challenges: (1) modality fragility where poor lighting and occlusions degrade vision-based methods, (2) physiological variability where heart-rate and skin-conductance patterns differ across individuals, and (3) privacy risks from centralized training requiring transmission of sensitive data. These issues hinder practical deployment of adaptive driver-assistance systems.

Method: FedMultiEmo uses a multimodal federated learning approach that fuses visual features (extracted by CNN from facial images) and physiological cues (heart rate, electrodermal activity, skin temperature classified by Random Forest) at the decision level using majority-vote fusion. The system implements personalized Federated Averaging that weights client updates by local data volume and operates on an edge-to-cloud prototype with Raspberry Pi clients and Flower server.

Result: The federated CNN achieved 77% accuracy, Random Forest 74%, and their fusion reached 87% accuracy, matching centralized baseline performance while maintaining data privacy. The system converged in 18 rounds with average round time of 120 seconds and per-client memory footprint below 200 MB, demonstrating practical feasibility for real-time deployment.

Conclusion: FedMultiEmo successfully addresses the key challenges of in-vehicle emotion recognition by providing a practical, privacy-preserving solution that maintains high accuracy while keeping sensitive data local, making it suitable for real-time automotive applications and adaptive driver-assistance systems.

Abstract: In-vehicle emotion recognition underpins adaptive driver-assistance systems and, ultimately, occupant safety. However, practical deployment is hindered by (i) modality fragility - poor lighting and occlusions degrade vision-based methods; (ii) physiological variability - heart-rate and skin-conductance patterns differ across individuals; and (iii) privacy risk - centralized training requires transmission of sensitive data. To address these challenges, we present FedMultiEmo, a privacy-preserving framework that fuses two complementary modalities at the decision level: visual features extracted by a Convolutional Neural Network from facial images, and physiological cues (heart rate, electrodermal activity, and skin temperature) classified by a Random Forest. FedMultiEmo builds on three key elements: (1) a multimodal federated learning pipeline with majority-vote fusion, (2) an end-to-end edge-to-cloud prototype on Raspberry Pi clients and a Flower server, and (3) a personalized Federated Averaging scheme that weights client updates by local data volume. Evaluated on FER2013 and a custom physiological dataset, the federated Convolutional Neural Network attains 77% accuracy, the Random Forest 74%, and their fusion 87%, matching a centralized baseline while keeping all raw data local. The developed system converges in 18 rounds, with an average round time of 120 seconds and a per-client memory footprint below 200 MB. These results indicate that FedMultiEmo offers a practical approach to real-time, privacy-aware emotion recognition in automotive settings.

cs.MA

[394] COMPASS: Cooperative Multi-Agent Persistent Monitoring using Spatio-Temporal Attention Network

Xingjian Zhang, Yizhuo Wang, Guillaume Sartoretti

Main category: cs.MA

TL;DR: COMPASS is a multi-agent reinforcement learning framework that uses decentralized agents with spatio-temporal attention networks and Gaussian Process target modeling to efficiently monitor multiple moving targets in dynamic environments.

Details

Motivation: Real-world applications like disaster response, environmental sensing, and wildlife conservation require persistent monitoring of dynamic targets by mobile agents under uncertainty, necessitating efficient coordination and continuous information gathering.

Method: The framework models environments as graphs with spatial nodes and topological edges, employs decentralized agents using shared spatio-temporal attention networks for action selection, models target dynamics with Gaussian Processes for uncertainty-aware planning, and trains using centralized value estimation with decentralized policy execution under adaptive rewards.

Result: COMPASS consistently outperforms strong baselines across dynamic multi-target scenarios in three key metrics: uncertainty reduction, target coverage, and coordination efficiency.

Conclusion: The proposed COMPASS framework successfully addresses persistent multi-target monitoring challenges by combining graph-based spatial reasoning, attention-based coordination, and probabilistic target modeling to achieve superior performance in dynamic environments.

Abstract: Persistent monitoring of dynamic targets is essential in real-world applications such as disaster response, environmental sensing, and wildlife conservation, where mobile agents must continuously gather information under uncertainty. We propose COMPASS, a multi-agent reinforcement learning (MARL) framework that enables decentralized agents to persistently monitor multiple moving targets efficiently. We model the environment as a graph, where nodes represent spatial locations and edges capture topological proximity, allowing agents to reason over structured layouts and revisit informative regions as needed. Each agent independently selects actions based on a shared spatio-temporal attention network that we design to integrate historical observations and spatial context. We model target dynamics using Gaussian Processes (GPs), which support principled belief updates and enable uncertainty-aware planning. We train COMPASS using centralized value estimation and decentralized policy execution under an adaptive reward setting. Our extensive experiments demonstrate that COMPASS consistently outperforms strong baselines in uncertainty reduction, target coverage, and coordination efficiency across dynamic multi-target scenarios.

[395] Smooth Games of Configuration in the Linear-Quadratic Setting

Jesse Milzman, Jeffrey Mao, Giuseppe Loianno

Main category: cs.MA

TL;DR: This paper introduces “game of configuration” - a two-stage framework where agents strategically choose configuration parameters in the first stage that affect their dynamics and costs in a subsequent differential game, providing methods to compute equilibrium solutions.

Details

Motivation: Existing dynamic game theory lacks research on strategic parametrization where each agent's configuration choice is influenced by others' decisions. While parametrization of dynamic games exists, the strategic perspective of configuration choice remains largely unexplored.

Method: The authors define a two-stage game within finite-horizon, affine-quadratic (AQ) differential games. They provide a subgame perfect solution concept, develop methods for computing first-stage cost gradients over configuration space, and formulate gradient-based algorithms for finding local solutions to configuration games.

Result: The framework successfully provides necessary conditions for equilibrium configurations and their downstream trajectories. The gradient-based method effectively searches for local solutions in the configuration game, demonstrated through AQ system examples including both zero-sum and general-sum scenarios.

Conclusion: The game of configuration framework offers an effective approach for strategic fine-tuning of differential games, enabling agents to optimally choose configuration parameters while considering other agents’ strategic decisions, with demonstrated effectiveness in various AQ system applications.

Abstract: Dynamic game theory offers a toolbox for formalizing and solving for both cooperative and non-cooperative strategies in multi-agent scenarios. However, the optimal configuration of such games remains largely unexplored. While there is existing literature on the parametrization of dynamic games, little research examines this parametrization from a strategic perspective where each agent’s configuration choice is influenced by the decisions of others. In this work, we introduce the concept of a game of configuration, providing a framework for the strategic fine-tuning of differential games. We define a game of configuration as a two-stage game within the setting of finite-horizon, affine-quadratic, AQ, differential games. In the first stage, each player chooses their corresponding configuration parameter, which will impact their dynamics and costs in the second stage. We provide the subgame perfect solution concept and a method for computing first stage cost gradients over the configuration space. This then allows us to formulate a gradient-based method for searching for local solutions to the configuration game, as well as provide necessary conditions for equilibrium configurations over their downstream (second stage) trajectories. We conclude by demonstrating the effectiveness of our approach in example AQ systems, both zero-sum and general-sum.

[396] Heterogeneous Mixed Traffic Control and Coordination

Iftekharul Islam, Weizi Li, Xuan Wang, Shuai Li, Kevin Heaslip

Main category: cs.MA

TL;DR: This study demonstrates that robot vehicles (RVs) can significantly improve traffic flow at urban intersections with mixed vehicle types, reducing waiting times by up to 91% compared to traditional methods while maintaining lower environmental impact than signalized traffic.

Details

Motivation: Urban intersections with diverse vehicle types (from small cars to large semi-trailers) create significant traffic control challenges, especially at unsignalized intersections during power outages where traditional traffic management methods fail to handle heterogeneous traffic effectively.

Method: The researchers used reinforcement learning (RL) algorithms with real-world data to simulate mixed traffic scenarios at complex intersections, testing various robot vehicle penetration rates from 10% to 90% to evaluate performance improvements.

Result: Average waiting times decreased by up to 86% compared to signalized intersections and 91% compared to unsignalized intersections. A “rarity advantage” was observed where less frequent vehicle types benefited most (up to 87% improvement). CO2 emissions and fuel consumption increased with RV penetration but remained below traditional signalized traffic levels. Space headways decreased, indicating more efficient road usage.

Conclusion: Robot vehicles show significant potential to improve traffic efficiency and reduce environmental impact in complex, heterogeneous urban intersection settings, offering substantial performance gains over both signalized and unsignalized traditional traffic control methods.

Abstract: Urban intersections with diverse vehicle types, from small cars to large semi-trailers, pose significant challenges for traffic control. This study explores how robot vehicles (RVs) can enhance heterogeneous traffic flow, particularly at unsignalized intersections where traditional methods fail during power outages. Using reinforcement learning (RL) and real-world data, we simulate mixed traffic at complex intersections with RV penetration rates ranging from 10% to 90%. Results show that average waiting times drop by up to 86% and 91% compared to signalized and unsignalized intersections, respectively. We observe a “rarity advantage,” where less frequent vehicles benefit the most (up to 87%). Although CO2 emissions and fuel consumption increase with RV penetration, they remain well below those of traditional signalized traffic. Decreased space headways also indicate more efficient road usage. These findings highlight RVs’ potential to improve traffic efficiency and reduce environmental impact in complex, heterogeneous settings.

Roland Pihlakas, Joel Pyykkö

Main category: cs.MA

TL;DR: This paper introduces new AI safety benchmarks inspired by biology and economics, focusing on multi-objective, multi-agent scenarios that test homeostasis, diminishing returns, sustainability, and resource sharing to identify key pitfalls in agentic AI systems.

Details

Motivation: Existing AI safety benchmarks neglect crucial themes from biology and economics - fundamental sciences that describe human needs and preferences. Current benchmarks fail to address important alignment challenges related to homeostatic objectives, resource management, and multi-agent interactions that are essential for developing safe, aligned agentic AI systems.

Method: The authors developed eight benchmark environments based on biologically and economically motivated themes, specifically focusing on: (1) homeostasis for bounded and biological objectives, (2) diminishing returns for unbounded instrumental and business objectives, (3) sustainability principles, and (4) resource sharing in multi-objective, multi-agent settings.

Result: The benchmarks successfully illustrate key pitfalls and challenges in agentic AI systems, including: unbounded maximization of homeostatic objectives, over-optimization of single objectives at the expense of others, neglecting safety constraints, and depletion of shared resources.

Conclusion: The paper establishes a new set of biologically and economically grounded benchmarks that reveal critical alignment challenges in agentic AI systems, providing a more comprehensive framework for testing AI safety that incorporates fundamental principles from biology and economics previously overlooked in mainstream AI safety discussions.

Abstract: Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety - namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing. We implemented eight main benchmark environments on the above themes, to illustrate key pitfalls and challenges in agentic AI-s, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

cs.MM

[398] Knowledge-aware Diffusion-Enhanced Multimedia Recommendation

Xian Mo, Fei Liu, Rui Tang, Jintao, Gao, Hao Liu

Main category: cs.MM

TL;DR: The paper proposes KDiffE, a knowledge-aware diffusion-enhanced architecture using contrastive learning for multimedia recommendations that combines attention-aware graph neural networks with guided diffusion models to generate task-relevant knowledge graphs for better recommendation performance.

Details

Motivation: Multimedia recommendations need to leverage rich multimedia content to enhance user-item interaction information, revealing both content relatedness among items and finer-grained user preferences, but existing methods may not effectively utilize knowledge graphs with reduced noise for semantic enhancement.

Method: The method consists of two main components: (1) an attention-aware matrix built from user-item graphs using random walk with restart strategy integrated into graph neural networks for main view construction, and (2) a guided diffusion model that uses user embeddings connected to items to generate strongly task-relevant knowledge graphs with reduced noise for knowledge-aware contrastive view construction.

Result: Comprehensive experiments on three multimedia datasets demonstrate the effectiveness of KDiffE and its components compared to various state-of-the-art methods, showing improved recommendation performance.

Conclusion: The proposed KDiffE architecture successfully combines attention-aware graph neural networks with guided diffusion models to enhance multimedia recommendations by generating cleaner, more task-relevant knowledge graphs that better capture item semantic information and user preferences.

Abstract: Multimedia recommendations aim to use rich multimedia content to enhance historical user-item interaction information, which can not only indicate the content relatedness among items but also reveal finer-grained preferences of users. In this paper, we propose a Knowledge-aware Diffusion-Enhanced architecture using contrastive learning paradigms (KDiffE) for multimedia recommendations. Specifically, we first utilize original user-item graphs to build an attention-aware matrix into graph neural networks, which can learn the importance between users and items for main view construction. The attention-aware matrix is constructed by adopting a random walk with a restart strategy, which can preserve the importance between users and items to generate aggregation of attention-aware node features. Then, we propose a guided diffusion model to generate strongly task-relevant knowledge graphs with less noise for constructing a knowledge-aware contrastive view, which utilizes user embeddings with an edge connected to an item to guide the generation of strongly task-relevant knowledge graphs for enhancing the item’s semantic information. We perform comprehensive experiments on three multimedia datasets that reveal the effectiveness of our KDiffE and its components on various state-of-the-art methods. Our source codes are available https://github.com/1453216158/KDiffE.

[399] Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, Zhi Wang

Main category: cs.MM

TL;DR: SoulDance introduces a high-precision motion capture dataset and SoulNet framework to generate music-aligned holistic 3D dance sequences that coordinate body, hands, and face movements for enhanced emotional expressiveness.

Details

Motivation: Generating well-coordinated, music-aligned holistic dance is challenging due to: (1) scarcity of holistic 3D dance datasets, (2) difficulty achieving cross-modal alignment between music and dance, and (3) complexity of modeling interdependent motion across body, hands, and face for emotional expressiveness and audience engagement.

Method: SoulNet framework with three components: (1) Hierarchical Residual Vector Quantization for modeling complex motion dependencies across body parts, (2) Music-Aligned Generative Model for composing hierarchical motion units into coordinated dance, and (3) Music-Motion Retrieval Module as a pre-trained cross-modal alignment prior for temporal and semantic coherence.

Result: SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences, as demonstrated through extensive experiments on the SoulDance dataset.

Conclusion: The paper successfully addresses holistic dance generation challenges by introducing a comprehensive dataset and novel framework that achieves superior performance in creating music-aligned, kinematically coordinated dance sequences involving body, hands, and face movements.

Abstract: Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences.

eess.AS

[400] Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention

Gene-Ping Yang, Sebastian Braun

Main category: eess.AS

TL;DR: This paper proposes a windowed cross-attention module to handle time-asynchronous microphone arrays in dynamic meeting environments, addressing the limitations of existing transform-average-concatenate (TAC) methods that assume time-synchronized setups.

Details

Motivation: Existing neural multi-microphone processing approaches assume time-synchronized microphone setups, but real-world meeting scenarios with personal devices have time latency and clock drift variations across devices. The popular TAC module is insufficient for handling these time-asynchronous conditions.

Method: The authors propose a windowed cross-attention module that can dynamically align features between all microphones. The module is invariant to microphone permutation and number, and can be integrated into existing models. They also propose an optimal training target for multi-talker environments.

Result: Experimental evaluation in multi-microphone noisy reverberant setups with unknown time latency and clock drift showed that the proposed method outperforms TAC on both iFaSNet and CRUSE models, with faster convergence and improved learning performance.

Conclusion: The windowed cross-attention module effectively handles asynchronous microphone setups in dynamic meeting environments, demonstrating superior performance compared to existing TAC-based approaches and providing a practical solution for real-world ad-hoc microphone array applications.

Abstract: The increasing number of microphone-equipped personal devices offers great flexibility and potential using them as ad-hoc microphone arrays in dynamic meeting environments. However, most existing approaches are designed for time-synchronized microphone setups, a condition that may not hold in real-world meeting scenarios, where time latency and clock drift vary across devices. Under such conditions, we found transform-average-concatenate (TAC), a popular module for neural multi-microphone processing, insufficient in handling time-asynchronous microphones. In response, we propose a windowed cross-attention module capable of dynamically aligning features between all microphones. This module is invariant to both the permutation and the number of microphones and can be easily integrated into existing models. Furthermore, we propose an optimal training target for multi-talker environments. We evaluated our approach in a multi-microphone noisy reverberant setup with unknown time latency and clock drift of each microphone. Experimental results show that our method outperforms TAC on both iFaSNet and CRUSE models, offering faster convergence and improved learning, demonstrating the efficacy of the windowed cross-attention module for asynchronous microphone setups.

[401] An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications

Sujith Pulikodan, Sahapthan K, Prasanta Kumar Ghosh, Visruth Sanka, Nihar Desai

Main category: eess.AS

TL;DR: This paper proposes a new evaluation metric for ASR systems that considers how well Large Language Models can correct ASR errors, moving beyond traditional Word Error Rate (WER) metrics to better assess ASR performance in LLM-powered applications.

Details

Motivation: Traditional ASR evaluation using Word Error Rate (WER) may not adequately reflect the impact of different types of ASR errors in downstream applications that increasingly use Large Language Models. There's a need to understand how LLMs can correct ASR errors and develop better evaluation metrics for ASR systems in LLM-powered applications.

Method: The authors analyze the capabilities of Large Language Models to correct errors introduced by Automatic Speech Recognition systems and develop a new evaluation measure specifically designed for assessing ASR performance in the context of LLM-powered applications.

Result: The paper proposes a new metric for evaluating ASR performance that takes into account the error correction capabilities of LLMs, providing a more relevant assessment than traditional WER for modern applications that integrate ASR with large language models.

Conclusion: The study demonstrates that traditional ASR evaluation metrics like WER are insufficient for modern LLM-powered applications, and introduces a new evaluation framework that better reflects the real-world performance of ASR systems when used in conjunction with Large Language Models.

Abstract: Automatic Speech Recognition (ASR) plays a crucial role in human-machine interaction and serves as an interface for a wide range of applications. Traditionally, ASR performance has been evaluated using Word Error Rate (WER), a metric that quantifies the number of insertions, deletions, and substitutions in the generated transcriptions. However, with the increasing adoption of large and powerful Large Language Models (LLMs) as the core processing component in various applications, the significance of different types of ASR errors in downstream tasks warrants further exploration. In this work, we analyze the capabilities of LLMs to correct errors introduced by ASRs and propose a new measure to evaluate ASR performance for LLM-powered applications.

[402] ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao

Main category: eess.AS

TL;DR: This paper introduces the first multimodal immersive spatial drama generation system that creates continuous multi-speaker binaural speech with dramatic prosody for AR/VR applications, along with the first dataset MRSDrama and novel model ISDrama.

Details

Motivation: The need for immersive spatial drama generation in AR/VR applications requires simultaneous modeling of spatial information and dramatic prosody from multimodal inputs, but faces challenges of high data collection costs and lack of existing solutions for this novel task.

Method: The authors propose ISDrama with two key components: 1) Multimodal Pose Encoder using contrastive learning that considers Doppler effects to extract unified pose information, and 2) Immersive Drama Transformer, a flow-based mamba-transformer with Drama-MOE for expert selection and enhanced prosody/pose control, plus a context-consistent classifier-free guidance strategy.

Result: ISDrama outperforms baseline models on both objective and subjective metrics for multimodal immersive spatial drama generation, demonstrating superior performance in creating high-quality binaural drama content.

Conclusion: This work successfully addresses the novel challenge of multimodal immersive spatial drama generation by creating the first specialized dataset MRSDrama and developing ISDrama, an effective model that advances the field for AR/VR applications.

Abstract: Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components:

Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.

[403] MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement

Nan Xu, Zhaolong Huang, Xiaonan Zhi

Main category: eess.AS

TL;DR: MDDM is a multi-view discriminative enhanced diffusion model for speech enhancement that combines discriminative learning with diffusion modeling to achieve competitive performance with fewer sampling steps while reducing speech distortions and computational cost.

Details

Motivation: Previous speech enhancement methods using discriminative supervised learning or generative modeling tend to introduce speech distortions or require high computational cost, necessitating a more efficient approach that maintains speech quality.

Method: The paper proposes MDDM which uses features from three domains (time, frequency, and noise) as inputs to a discriminative prediction network that generates preliminary spectrograms, then converts the discriminative output to clean speech through several inference sampling steps in a diffusion framework.

Result: MDDM achieves competitive performance compared to other diffusion-based methods while requiring fewer sampling steps, validated on both public and real-world datasets using subjective and objective metrics.

Conclusion: The intersection of distributions between discriminative output and clean target enables MDDM to achieve competitive performance with reduced sampling steps, making it an effective solution for speech enhancement that balances quality and computational efficiency.

Abstract: With the development of deep learning, speech enhancement has been greatly optimized in terms of speech quality. Previous methods typically focus on the discriminative supervised learning or generative modeling, which tends to introduce speech distortions or high computational cost. In this paper, we propose MDDM, a Multi-view Discriminative enhanced Diffusion-based Model. Specifically, we take the features of three domains (time, frequency and noise) as inputs of a discriminative prediction network, generating the preliminary spectrogram. Then, the discriminative output can be converted to clean speech by several inference sampling steps. Due to the intersection of the distributions between discriminative output and clean target, the smaller sampling steps can achieve the competitive performance compared to other diffusion-based methods. Experiments conducted on a public dataset and a realworld dataset validate the effectiveness of MDDM, either on subjective or objective metric.

eess.IV

[404] Systole-Conditioned Generative Cardiac Motion

Shahar Zuler, Gal Lifshitz, Hadar Averbuch-Elor, Dan Raviv

Main category: eess.IV

TL;DR: The paper presents a novel approach using conditional Variational Autoencoder (CVAE) to generate synthetic cardiac CT frame pairs with dense 3D flow field annotations, addressing the challenge of obtaining labeled data for cardiac motion estimation.

Details

Motivation: Accurate cardiac motion estimation in CT imaging is critical for cardiac function assessment and surgical planning, but data-driven methods require vast amounts of labeled data with dense ground-truth motion annotations that are often unfeasible to obtain manually.

Method: The approach uses a conditional Variational Autoencoder (CVAE) with a multi-scale feature conditioning mechanism that generates 3D flow fields conditioned on a single CT frame. The generated flow field is then applied to warp the input frame, creating realistic frame pairs that simulate myocardium deformations across the cardiac cycle.

Result: The method successfully synthesizes realistically looking pairs of cardiac CT frames enriched with dense 3D flow field annotations, providing fully annotated data samples with optical flow ground-truth annotations for training motion estimation models.

Conclusion: The proposed data generation pipeline enables training and validation of more complex and accurate myocardium motion models while substantially reducing reliance on manual annotations, offering a practical solution to the data scarcity problem in cardiac motion estimation.

Abstract: Accurate motion estimation in cardiac computed tomography (CT) imaging is critical for assessing cardiac function and surgical planning. Data-driven methods have become the standard approach for dense motion estimation, but they rely on vast amounts of labeled data with dense ground-truth (GT) motion annotations, which are often unfeasible to obtain. To address this limitation, we present a novel approach that synthesizes realistically looking pairs of cardiac CT frames enriched with dense 3D flow field annotations. Our method leverages a conditional Variational Autoencoder (CVAE), which incorporates a novel multi-scale feature conditioning mechanism and is trained to generate 3D flow fields conditioned on a single CT frame. By applying the generated flow field to warp the given frame, we create pairs of frames that simulate realistic myocardium deformations across the cardiac cycle. These pairs serve as fully annotated data samples, providing optical flow GT annotations. Our data generation pipeline could enable the training and validation of more complex and accurate myocardium motion models, allowing for substantially reducing reliance on manual annotations. Our code, along with animated generated samples and additional material, is available on our project page: https://shaharzuler.github.io/GenerativeCardiacMotion_Page.

[405] Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices

Haitian Wang, Xinyu Wang, Yiren Wang, Karen Lee, Zichen Geng, Xian Zhang, Kehkashan Kiran, Yu Zhang, Bo Miao

Main category: eess.IV

TL;DR: QANA is a quantization-aware neuromorphic architecture that enables efficient skin lesion classification on edge devices, achieving 91.6% accuracy on HAM10000 dataset while reducing inference latency by 94.6% and energy consumption by 98.6% compared to GPU-based CNNs when deployed on neuromorphic hardware.

Details

Motivation: Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints in resource-limited environments.

Method: QANA integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for feature representation, with quantization-aware head and spike-compatible transformations enabling conversion to spiking neural networks (SNNs) for neuromorphic platform deployment.

Result: QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000 dataset, and 90.8%/81.7% on clinical dataset. When deployed on BrainChip Akida hardware, it achieves 1.5ms inference latency and 1.7mJ energy per image, reducing latency by 94.6% and energy use by 98.6% compared to GPU-based CNNs.

Conclusion: QANA demonstrates effectiveness for accurate, real-time, and privacy-sensitive medical analysis in edge environments, significantly outperforming state-of-the-art CNN-to-SNN models while maintaining high classification accuracy for dermatological applications.

Abstract: Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000, and 90.8% / 81.7% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5,ms inference latency and 1.7,mJ energy per image, reducing inference latency and energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.

[406] MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation

Nand Kumar Yadav, Rodrigue Rizk, Willium WC Chen, KC

Main category: eess.IV

TL;DR: MLRU++ is a lightweight hybrid CNN-Transformer architecture for 3D medical image segmentation that achieves state-of-the-art performance while significantly reducing computational complexity through novel attention mechanisms and multiscale feature aggregation.

Details

Motivation: Medical image segmentation faces challenges from anatomical variability and high computational demands on volumetric data. While hybrid CNN-Transformer architectures achieve excellent results, they add significant complexity, creating a need for efficient architectures that balance accuracy and computational efficiency.

Method: The paper proposes MLRU++, a Multiscale Lightweight Residual UNETR++ architecture with two key innovations: (1) Lightweight Channel and Bottleneck Attention Module (LCBAM) for enhanced contextual feature encoding with minimal overhead, and (2) Multiscale Bottleneck Block (M2B) in the decoder for capturing fine-grained details through multi-resolution feature aggregation.

Result: MLRU++ achieves state-of-the-art performance on four benchmark datasets with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). It improves Dice scores by 5.38% and 2.12% on Synapse and ACDC respectively compared to existing leading models, while significantly reducing parameter count and computational cost. Ablation studies confirm the effectiveness of both LCBAM and M2B components.

Conclusion: MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks by successfully balancing segmentation accuracy with computational efficiency. The proposed architecture demonstrates that lightweight designs can achieve superior performance while being more computationally efficient than existing complex hybrid models.

Abstract: Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP

[407] SFNet: A Spatio-Frequency Domain Deep Learning Network for Efficient Alzheimer’s Disease Diagnosis

Xinyue Yang, Meiliang Liu, Yunfang Xu, Xiaoxiao Yang, Zhengye Si, Zijin Li, Zhiwen Zhao

Main category: eess.IV

TL;DR: The paper proposes SFNet, the first end-to-end deep learning framework that simultaneously uses spatial and frequency domain information from 3D MRI scans to improve Alzheimer’s disease diagnosis, achieving 95.1% accuracy on the ADNI dataset.

Details

Motivation: Existing AD diagnostic models typically extract features from only one domain (spatial or frequency), limiting their ability to fully capture complex neuroimaging characteristics. While MRI contains both spatial and frequency information, most studies focus on single-domain analysis or are limited to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored.

Method: The authors developed SFNet (Spatio-Frequency Network), an end-to-end deep learning framework that integrates: (1) an enhanced dense convolutional network for extracting local spatial features, (2) a global frequency module for capturing frequency-domain representations, and (3) a novel multi-scale attention module to refine spatial feature extraction.

Result: SFNet achieved 95.1% accuracy in classifying cognitively normal (CN) and Alzheimer’s disease (AD) cases on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. The method outperformed existing baselines while reducing computational overhead.

Conclusion: SFNet successfully demonstrates that simultaneously leveraging both spatial and frequency domain information from 3D MRI can significantly enhance AD diagnosis performance. This dual-domain approach represents a promising direction for improving early detection of Alzheimer’s disease using non-invasive neuroimaging techniques.

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly population and currently has no cure. Magnetic Resonance Imaging (MRI), as a non-invasive imaging technique, is essential for the early diagnosis of AD. MRI inherently contains both spatial and frequency information, as raw signals are acquired in the frequency domain and reconstructed into spatial images via the Fourier transform. However, most existing AD diagnostic models extract features from a single domain, limiting their capacity to fully capture the complex neuroimaging characteristics of the disease. While some studies have combined spatial and frequency information, they are mostly confined to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored. To overcome this limitation, we propose Spatio-Frequency Network (SFNet), the first end-to-end deep learning framework that simultaneously leverages spatial and frequency domain information to enhance 3D MRI-based AD diagnosis. SFNet integrates an enhanced dense convolutional network to extract local spatial features and a global frequency module to capture global frequency-domain representations. Additionally, a novel multi-scale attention module is proposed to further refine spatial feature extraction. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ANDI) dataset demonstrate that SFNet outperforms existing baselines and reduces computational overhead in classifying cognitively normal (CN) and AD, achieving an accuracy of 95.1%.

[408] Physics-Driven Neural Network for Solving Electromagnetic Inverse Scattering Problems

Yutong Du, Zicheng Liu, Bazargul Matkerim, Changyou Li, Yali Zong, Bo Qi, Jingwei Kou

Main category: eess.IV

TL;DR: A physics-driven neural network (PDNN) approach for solving inverse scattering problems that avoids generalization issues by training only on scattered field data rather than requiring large datasets, while achieving high accuracy and stability for composite lossy scatterers.

Details

Motivation: Existing deep learning methods for inverse scattering problems rely heavily on large datasets and suffer from limited generalization capabilities, making them impractical for real-world applications where training data may be scarce or domain-specific.

Method: The paper proposes a physics-driven neural network (PDNN) that iteratively updates solutions by optimizing hyperparameters through a loss function incorporating scattered field constraints and prior scatterer information. The method identifies subregions containing scatterers to improve imaging efficiency and trains using only collected scattered fields and computed scattered fields from predicted solutions.

Result: Numerical and experimental validation shows the proposed PDNN scheme achieves high reconstruction accuracy and strong stability, even when dealing with challenging composite lossy scatterers. The method demonstrates superior performance without requiring extensive training datasets.

Conclusion: The physics-driven neural network approach successfully addresses the generalization limitations of data-driven methods for inverse scattering problems by incorporating physical constraints directly into the training process, enabling accurate and stable reconstruction without relying on large datasets.

Abstract: In recent years, deep learning-based methods have been proposed for solving inverse scattering problems (ISPs), but most of them heavily rely on data and suffer from limited generalization capabilities. In this paper, a new solving scheme is proposed where the solution is iteratively updated following the updating of the physics-driven neural network (PDNN), the hyperparameters of which are optimized by minimizing the loss function which incorporates the constraints from the collected scattered fields and the prior information about scatterers. Unlike data-driven neural network solvers, PDNN is trained only requiring the input of collected scattered fields and the computation of scattered fields corresponding to predicted solutions, thus avoids the generalization problem. Moreover, to accelerate the imaging efficiency, the subregion enclosing the scatterers is identified. Numerical and experimental results demonstrate that the proposed scheme has high reconstruction accuracy and strong stability, even when dealing with composite lossy scatterers.

[409] A High Magnifications Histopathology Image Dataset for Oral Squamous Cell Carcinoma Diagnosis and Prognosis

Jinquan Guan, Junhong Guo, Qi Chen, Jian Chen, Yongkang Cai, Yilin He, Zhiquan Huang, Yan Wang, Yutong Xie

Main category: eess.IV

TL;DR: This paper introduces Multi-OSCC, a comprehensive histopathology image dataset of 1,325 OSCC patients with both diagnostic and prognostic annotations, achieving excellent performance with top AUCs of 94.72% for recurrence prediction and 81.23% for tumor differentiation across six clinical tasks.

Details

Motivation: Existing publicly available OSCC datasets suffer from limited patient cohorts and restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable deep learning models for computer-aided diagnosis and prognosis of this prevalent and aggressive malignancy.

Method: The authors created Multi-OSCC dataset with 1,325 OSCC patients, each represented by six high-resolution histopathology images at different magnifications (x200, x400, x1000) covering core and edge tumor regions. They systematically benchmarked the dataset by evaluating different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks across six clinical tasks: recurrence prediction, lymph node metastasis, tumor differentiation, tumor invasion, cancer embolus, and perineural invasion.

Result: Top-performing models achieved excellent results with AUC of 94.72% for recurrence prediction and 81.23% for tumor differentiation, with all tasks surpassing 70% AUC. Key findings include: stain normalization benefits diagnostic tasks but negatively affects recurrence prediction, and multi-task learning shows 3.34% average AUC degradation compared to single-task models.

Conclusion: The Multi-OSCC dataset successfully bridges the gap in comprehensive OSCC datasets by providing both diagnostic and prognostic information for a large patient cohort. The benchmark results demonstrate the dataset’s utility while highlighting challenges in multi-task learning, and the public release of the dataset and baseline models will accelerate future research in OSCC computer-aided diagnosis and prognosis.

Abstract: Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical assessments.However, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models. To bridge this gap, we introduce Multi-OSCC, a new histopathology image dataset comprising 1,325 OSCC patients, integrating both diagnostic and prognostic information to expand existing public resources. Each patient is represented by six high resolution histopathology images captured at x200, x400, and x1000 magnifications-two per magnification-covering both the core and edge tumor regions.The Multi-OSCC dataset is richly annotated for six critical clinical tasks: recurrence prediction (REC), lymph node metastasis (LNM), tumor differentiation (TD), tumor invasion (TI), cancer embolus (CE), and perineural invasion (PI). To benchmark this dataset, we systematically evaluate the impact of different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks. Our analysis yields several key insights: (1) The top-performing models achieve excellent results, with an Area Under the Curve (AUC) of 94.72% for REC and 81.23% for TD, while all tasks surpass 70% AUC; (2) Stain normalization benefits diagnostic tasks but negatively affects recurrence prediction; (3) Multi-task learning incurs a 3.34% average AUC degradation compared to single-task models in our multi-task benchmark, underscoring the challenge of balancing multiple tasks in our dataset. To accelerate future research, we publicly release the Multi-OSCC dataset and baseline models at https://github.com/guanjinquan/OSCC-PathologyImageDataset.

[410] Semantic Segmentation for Preoperative Planning in Transcatheter Aortic Valve Replacement

Cedric Zöllner, Simon Reiß, Alexander Jaus, Amroalalaa Sholi, Ralf Sodian, Rainer Stiefelhagen

Main category: eess.IV

TL;DR: This paper develops AI-based semantic segmentation models to support preoperative planning for transcatheter aortic valve replacement (TAVR) surgeries by automatically identifying and measuring relevant anatomical structures in CT scans, achieving improved performance through adapted loss functions.

Details

Motivation: Medical doctors need support during preoperative planning for TAVR surgeries when assessing medical images. Current assessment processes could benefit from AI methods that can automatically identify and make anatomical structures measurable in computed tomography scans according to medical guidelines.

Method: The authors derive fine-grained TAVR-relevant pseudo-labels from coarse-grained anatomical information to train semantic segmentation models. They propose an adaptation to the loss function during training to improve model performance in identifying anatomical structures in CT scans.

Result: The proposed loss function adaptation achieved a +1.27% Dice score increase in segmentation performance. The models successfully quantified their ability to find TAVR-relevant anatomical structures in CT scans. Fine-grained pseudo-labels and CT scan datasets were made publicly available.

Conclusion: AI-based semantic segmentation models with adapted loss functions can effectively support preoperative TAVR planning by automatically identifying and measuring relevant anatomical structures in CT scans, with the improved methodology achieving measurable performance gains over baseline approaches.

Abstract: When preoperative planning for surgeries is conducted on the basis of medical images, artificial intelligence methods can support medical doctors during assessment. In this work, we consider medical guidelines for preoperative planning of the transcatheter aortic valve replacement (TAVR) and identify tasks, that may be supported via semantic segmentation models by making relevant anatomical structures measurable in computed tomography scans. We first derive fine-grained TAVR-relevant pseudo-labels from coarse-grained anatomical information, in order to train segmentation models and quantify how well they are able to find these structures in the scans. Furthermore, we propose an adaptation to the loss function in training these segmentation models and through this achieve a +1.27% Dice increase in performance. Our fine-grained TAVR-relevant pseudo-labels and the computed tomography scans we build upon are available at https://doi.org/10.5281/zenodo.16274176.

[411] Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis

Xiaojiao Xiao, Qinmin Vivian Hu, Guanghui Wang

Main category: eess.IV

TL;DR: The paper introduces PHMDiff, a Pyramid Hierarchical Masked Diffusion Model for medical image synthesis that uses multi-scale hierarchical approach with randomly masked training to generate high-quality medical images across different modalities, achieving superior performance in PSNR and SSIM metrics.

Details

Motivation: Medical image synthesis is crucial for addressing missing imaging modalities in clinical workflows due to extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. There is a need for better control over synthesizing high-quality images across different resolutions and layers.

Method: The paper proposes PHMDiff (Pyramid Hierarchical Masked Diffusion Model) that employs a multi-scale hierarchical approach with randomly multi-scale high-proportion masks to accelerate diffusion model training. It integrates a Transformer-based diffusion process with cross-granularity regularization to model mutual information consistency across granularity latent spaces, balancing detail fidelity and overall structure.

Result: Comprehensive experiments on two challenging datasets show that PHMDiff achieves superior performance in both Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) compared to other methods. The model demonstrates capability to produce high-quality synthesized images with excellent structural integrity across and within medical imaging modalities.

Conclusion: PHMDiff represents a significant advancement in medical image synthesis, offering a multi-scale framework that outperforms existing methods. Ablation studies confirm the contribution of each component, and the model shows significant advantages for both cross-modality and within-modality medical image synthesis tasks.

Abstract: Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity’s latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff

[412] A Tutorial on MRI Reconstruction: From Modern Methods to Clinical Implications

Tolga Çukur, Salman U. H. Dar, Valiyeh Ansarian Nezhad, Yohan Jun, Tae Hyung Kim, Shohei Fujita, Berkin Bilgic

Main category: eess.IV

TL;DR: This tutorial paper reviews MRI reconstruction methods, covering classical hand-crafted prior approaches and modern deep learning techniques that combine learned and crafted priors to accelerate MRI acquisition while maintaining diagnostic quality.

Details

Motivation: MRI scans require long acquisition times which reduce patient throughput, increase motion artifacts, and may compromise image quality or diagnostic scope. There is a need to accelerate MRI acquisitions while preserving diagnostic quality for clinical and research applications.

Method: The paper provides a tutorial overview of MRI reconstruction approaches, starting with classical methods using explicit hand-crafted priors, then progressing to deep learning methods that leverage both learned and crafted priors. The tutorial is accompanied by a Python toolbox for practical demonstration.

Result: The tutorial presents state-of-the-art MRI reconstruction techniques that can accelerate acquisitions while maintaining diagnostic quality. Deep learning methods that combine learned and crafted priors show particular promise in pushing performance boundaries beyond classical approaches.

Conclusion: Advances in MRI reconstruction algorithms, combined with hardware and pulse sequence improvements, have enabled significant acceleration of MRI acquisitions. The integration of deep learning with traditional reconstruction methods represents a promising direction, though challenges remain that require future research attention.

Abstract: MRI is an indispensable clinical tool, offering a rich variety of tissue contrasts to support broad diagnostic and research applications. Clinical exams routinely acquire multiple structural sequences that provide complementary information for differential diagnosis, while research protocols often incorporate advanced functional, diffusion, spectroscopic, and relaxometry sequences to capture multidimensional insights into tissue structure and composition. However, these capabilities come at the cost of prolonged scan times, which reduce patient throughput, increase susceptibility to motion artifacts, and may require trade-offs in image quality or diagnostic scope. Over the last two decades, advances in image reconstruction algorithms–alongside improvements in hardware and pulse sequence design–have made it possible to accelerate acquisitions while preserving diagnostic quality. Central to this progress is the ability to incorporate prior information to regularize the solutions to the reconstruction problem. In this tutorial, we overview the basics of MRI reconstruction and highlight state-of-the-art approaches, beginning with classical methods that rely on explicit hand-crafted priors, and then turning to deep learning methods that leverage a combination of learned and crafted priors to further push the performance envelope. We also explore the translational aspects and eventual clinical implications of these methods. We conclude by discussing future directions to address remaining challenges in MRI reconstruction. The tutorial is accompanied by a Python toolbox (https://github.com/tutorial-MRI-recon/tutorial) to demonstrate select methods discussed in the article.

[413] Improving U-Net Confidence on TEM Image Data with L2-Regularization, Transfer Learning, and Deep Fine-Tuning

Aiden Ochoa, Xinyuan Xu, Xing Wang

Main category: eess.IV

TL;DR: This paper addresses the challenge of automated nanoscale defect detection in TEM images by applying transfer learning from natural image models, achieving 57% improvement in defect detection while introducing novel evaluation metrics independent of annotation accuracy.

Details

Motivation: Large data volumes require automated TEM image analysis, but nanoscale defects show high variation due to complex contrast mechanisms and structures, leading to limited labeled data and high annotation error rates that hinder machine learning performance.

Method: Transfer learning approach using pre-trained encoders from natural image models with L2-regularization to focus on simpler, reliable features while ignoring semantically complex ones. Novel evaluation metrics independent of annotation accuracy were introduced to properly assess performance.

Result: 57% improvement in defect detection rate for grain boundary detection in UO2 TEM images. Model self-confidence was achieved through transfer learning and fine-tuning of very deep layers. Conventional metrics like F1-score were shown to be inadequate due to annotation errors.

Conclusion: Transfer learning from natural image models effectively improves TEM defect detection by leveraging simpler features, and novel evaluation metrics are necessary to properly assess model performance independent of human annotation errors in this domain.

Abstract: With ever-increasing data volumes, it is essential to develop automated approaches for identifying nanoscale defects in transmission electron microscopy (TEM) images. However, compared to features in conventional photographs, nanoscale defects in TEM images exhibit far greater variation due to the complex contrast mechanisms and intricate defect structures. These challenges often result in much less labeled data and higher rates of annotation errors, posing significant obstacles to improving machine learning model performance for TEM image analysis. To address these limitations, we examined transfer learning by leveraging large, pre-trained models used for natural images. We demonstrated that by using the pre-trained encoder and L2-regularization, semantically complex features are ignored in favor of simpler, more reliable cues, substantially improving the model performance. However, this improvement cannot be captured by conventional evaluation metrics such as F1-score, which can be skewed by human annotation errors treated as ground truth. Instead, we introduced novel evaluation metrics that are independent of the annotation accuracy. Using grain boundary detection in UO2 TEM images as a case study, we found that our approach led to a 57% improvement in defect detection rate, which is a robust and holistic measure of model performance on the TEM dataset used in this work. Finally, we showed that model self-confidence is only achieved through transfer learning and fine-tuning of very deep layers.

[414] MultiTaskDeltaNet: Change Detection-based Image Segmentation for Operando ETEM with Application to Carbon Gasification Kinetics

Yushuo Niu, Tianyu Li, Yuanyuan Zhu, Qian Yang

Main category: eess.IV

TL;DR: The paper introduces MultiTaskDeltaNet (MTDN), a novel deep learning architecture that reconceptualizes semantic segmentation as change detection for analyzing in-situ TEM imaging of solid-state reactions, achieving 10.22% performance improvement over conventional models in detecting small and ambiguous features.

Details

Motivation: Traditional deep learning methods for semantic segmentation of TEM images face limitations due to scarcity of labeled data, visually ambiguous features, and small-object scenarios. There is a need for automated, high-precision semantic segmentation of dynamically evolving features in in-situ TEM imaging for spatially-resolved operando characterization of solid-state reactions.

Method: MTDN uses a Siamese network with U-Net backbone that reconceptualizes segmentation as a change detection problem. It employs paired images to capture feature changes and implements a multi-task learning strategy to leverage correlations between physical features of interest, enabling effective utilization of minimal data.

Result: MTDN demonstrated significant advantages over conventional segmentation models when evaluated on in-situ environmental TEM (ETEM) videos of filamentous carbon gasification. The model achieved a 10.22% performance improvement over conventional segmentation models, particularly excelling in accurately delineating fine structural features and predicting small, visually ambiguous physical features.

Conclusion: This work successfully bridges key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings. MTDN provides an effective solution for high-precision semantic segmentation in challenging TEM imaging scenarios with limited labeled data.

Abstract: Transforming in-situ transmission electron microscopy (TEM) imaging into a tool for spatially-resolved operando characterization of solid-state reactions requires automated, high-precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often encounter limitations due to the scarcity of labeled data, visually ambiguous features of interest, and small-object scenarios. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U-Net backbone and using paired images to capture feature changes, MTDN effectively utilizes minimal data to produce high-quality segmentations. Furthermore, MTDN utilizes a multi-task learning strategy to leverage correlations between physical features of interest. In an evaluation using data from in-situ environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges several key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.

[415] Now and Future of Artificial Intelligence-based Signet Ring Cell Diagnosis: A Survey

Zhu Meng, Junhao Dong, Limei Guo, Fei Su, Jiaxuan Liu, Guangxi Wang, Zhicheng Zhao

Main category: eess.IV

TL;DR: This paper presents a comprehensive survey of AI-driven signet ring cell (SRC) analysis from 2008-2025, systematically reviewing unimodal and multi-modal approaches for automated SRC detection and diagnosis to bridge the gap between algorithmic capabilities and clinical applicability.

Details

Motivation: Signet ring cells are associated with high metastasis propensity and poor prognosis, critically influencing surgical decisions, but their detection remains challenging even for experienced pathologists. While AI-based automated SRC diagnosis shows potential for enhanced diagnostic efficiency and accuracy, existing methodologies lack systematic review, creating a gap in assessing disparities between algorithmic capabilities and clinical needs.

Method: The authors conducted a comprehensive survey analyzing AI-driven SRC analysis literature from 2008 through June 2025. They systematically categorized algorithms into unimodal approaches (image, omics, and text data) and multi-modal approaches (integrating two or more data modalities). Image-based algorithms were further subdivided into classification, detection, segmentation, and foundation model tasks.

Result: The survey provides a systematic summary of SRC biological characteristics, challenges in automated identification, and comprehensive categorization of representative algorithms. It evaluates current methodological performance against clinical assistance requirements and identifies existing gaps between computational capabilities and clinical practice needs.

Conclusion: The survey identifies unresolved challenges and future research directions in SRC analysis, aiming to assist researchers (particularly those without medical backgrounds) in understanding the SRC analysis landscape and prospects for intelligent diagnosis, thereby accelerating the translation of computational algorithms into clinical practice.

Abstract: Signet ring cells (SRCs), associated with a high propensity for peripheral metastasis and poor prognosis, critically influence surgical decision-making and outcome prediction. However, their detection remains challenging even for experienced pathologists. While artificial intelligence (AI)-based automated SRC diagnosis has gained increasing attention for its potential to enhance diagnostic efficiency and accuracy, existing methodologies lack systematic review. This gap impedes the assessment of disparities between algorithmic capabilities and clinical applicability. This paper presents a comprehensive survey of AI-driven SRC analysis from 2008 through June 2025. We systematically summarize the biological characteristics of SRCs and challenges in their automated identification. Representative algorithms are analyzed and categorized as unimodal or multi-modal approaches. Unimodal algorithms, encompassing image, omics, and text data, are reviewed; image-based ones are further subdivided into classification, detection, segmentation, and foundation model tasks. Multi-modal algorithms integrate two or more data modalities (images, omics, and text). Finally, by evaluating current methodological performance against clinical assistance requirements, we discuss unresolved challenges and future research directions in SRC analysis. This survey aims to assist researchers, particularly those without medical backgrounds, in understanding the landscape of SRC analysis and the prospects for intelligent diagnosis, thereby accelerating the translation of computational algorithms into clinical practice.

[416] FLLIC: Functionally Lossless Image Compression

Xi Zhang, Xiaolin Wu

Main category: eess.IV

TL;DR: This paper proposes functionally lossless image compression (FLLIC) that jointly performs denoising and compression, achieving better performance than traditional mathematically lossless compression by questioning the necessity of preserving sensor noise.

Details

Motivation: Traditional mathematically lossless image compression (MLLIC) preserves acquisition noise from digital sensors, wasting bits on unnecessary information. Despite recent DNN advances reducing bit rates by ~10%, MLLIC still doesn't meet bandwidth and cost requirements for practical imaging systems.

Method: The authors propose functionally lossless image compression (FLLIC), a new paradigm that performs joint denoising and compression. Instead of preserving noisy input exactly, FLLIC compresses optimally denoised images to achieve the best possible reconstruction of the latent noise-free original image.

Result: FLLIC achieves state-of-the-art performance in joint denoising and compression of noisy images while operating at lower computational cost compared to existing methods. The approach demonstrates superior performance over traditional mathematically lossless compression approaches.

Conclusion: By questioning the necessity of mathematically lossless compression and proposing joint denoising-compression, FLLIC overcomes performance barriers of traditional lossless compression while achieving better reconstruction of the original noise-free image at reduced computational cost.

Abstract: Recently, DNN models for lossless image coding have surpassed their traditional counterparts in compression performance, reducing the previous lossless bit rate by about ten percent for natural color images. But even with these advances, mathematically lossless image compression (MLLIC) ratios for natural images still fall short of the bandwidth and cost-effectiveness requirements of most practical imaging and vision systems at present and beyond. To overcome the performance barrier of MLLIC, we question the very necessity of MLLIC. Considering that all digital imaging sensors suffer from acquisition noises, why should we insist on mathematically lossless coding, i.e., wasting bits to preserve noises? Instead, we propose a new paradigm of joint denoising and compression called functionally lossless image compression (FLLIC), which performs lossless compression of optimally denoised images (the optimality may be task-specific). Although not literally lossless with respect to the noisy input, FLLIC aims to achieve the best possible reconstruction of the latent noise-free original image. Extensive experiments show that FLLIC achieves state-of-the-art performance in joint denoising and compression of noisy images and does so at a lower computational cost.

[417] Recurrent Inference Machine for Medical Image Registration

Yi Zhang, Yidong Zhao, Hui Xue, Peter Kellman, Stefan Klein, Qian Tao

Main category: eess.IV

TL;DR: The paper proposes RIIR (Recurrent Inference Image Registration), a meta-learning approach that combines deep learning with iterative optimization to achieve better registration accuracy and data efficiency compared to existing deep learning methods, requiring only 5% of training data while outperforming other approaches.

Details

Motivation: Traditional optimization-based registration methods are training-free but slow, while deep learning methods are fast but may sacrifice accuracy and require large datasets. There's a need for a method that combines the benefits of both approaches - maintaining registration accuracy while being data-efficient and leveraging the speed of deep learning.

Method: RIIR formulates image registration as a meta-learning problem solved iteratively. It learns the update rule of optimization with implicit regularization combined with explicit gradient input. The method uses a recurrent inference framework with hidden states to perform registration in an iterative manner, mimicking optimization-based approaches while leveraging neural network capabilities.

Result: RIIR outperformed various deep learning-based registration methods on brain MRI and quantitative cardiac MRI datasets. The method demonstrated high data efficiency by achieving superior performance using only 5% of the training data compared to other methods. Ablation studies confirmed the importance of hidden states in the recurrent inference framework for effective meta-learning.

Conclusion: RIIR successfully addresses the trade-off between accuracy and data efficiency in deep learning-based medical image registration. The proposed method offers a highly data-efficient framework that combines the advantages of both optimization-based and deep learning approaches, making it valuable for medical image registration applications where training data may be limited.

Abstract: Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input. We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.

[418] Investigation of unsupervised and supervised hyperspectral anomaly detection

Mazharul Hossain, Aaron Robinson, Lan Wang, Chrysanthe Preza

Main category: eess.IV

TL;DR: This paper evaluates hyperspectral anomaly detection methods, comparing supervised and unsupervised approaches including their previously developed hybrid ensemble technique that combines unsupervised algorithms with supervised classifiers.

Details

Motivation: Hyperspectral anomaly detection is crucial for applications in agriculture, environment, and military (RSTA missions). While supervised methods improve detection accuracy, they fail to detect novel patterns that deviate from training data. There's a need to evaluate different approaches to provide new insights into their performance.

Method: The authors evaluate their previously developed hybrid ensemble approach that combines hyperspectral unmixing with three unsupervised HS-AD algorithms using supervised classifier-determined weights, comparing it against other supervised and unsupervised methods using general hyperspectral data.

Result: The paper provides comparative evaluation results of various hyperspectral anomaly detection methods, offering new insights into the performance of supervised vs unsupervised approaches and their hybrid ensemble technique.

Conclusion: The evaluation demonstrates the trade-offs between supervised and unsupervised hyperspectral anomaly detection methods, highlighting that while supervised methods can improve accuracy, they have limitations in detecting novel patterns, making the hybrid ensemble approach a valuable contribution to the field.

Abstract: Hyperspectral sensing is a valuable tool for detecting anomalies and distinguishing between materials in a scene. Hyperspectral anomaly detection (HS-AD) helps characterize the captured scenes and separates them into anomaly and background classes. It is vital in agriculture, environment, and military applications such as RSTA (reconnaissance, surveillance, and target acquisition) missions. We previously designed an equal voting ensemble of hyperspectral unmixing and three unsupervised HS-AD algorithms. We later utilized a supervised classifier to determine the weights of a voting ensemble, creating a hybrid of heterogeneous unsupervised HS-AD algorithms with a supervised classifier in a model stacking, which improved detection accuracy. However, supervised classification methods usually fail to detect novel or unknown patterns that substantially deviate from those seen previously. In this work, we evaluate our technique and other supervised and unsupervised methods using general hyperspectral data to provide new insights.

[419] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

An Wanga, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren

Main category: eess.IV

TL;DR: EndoControlMag is a training-free framework that magnifies subtle vascular motions in endoscopic surgery videos using Lagrangian-based motion tracking with mask-conditioned magnification to improve surgical precision and decision-making.

Details

Motivation: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, but remains challenging due to the complex and dynamic nature of surgical scenes with issues like occlusions, instrument disturbance, and tissue deformations.

Method: The method consists of two key modules: (1) Periodic Reference Resetting (PRR) that divides videos into overlapping clips with dynamically updated reference frames to prevent error accumulation, and (2) Hierarchical Tissue-aware Magnification (HTM) with dual-mode mask dilation that tracks vessel cores and applies adaptive softening strategies (motion-based or distance-based) to surrounding tissues.

Result: Evaluation on the EndoVMM24 dataset across four surgery types shows that EndoControlMag significantly outperforms existing methods in magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions including occlusions, instrument disturbance, view changes, and vessel deformations.

Conclusion: EndoControlMag successfully addresses the challenge of vascular motion visualization in endoscopic surgery through its training-free, Lagrangian-based approach with mask-conditioned magnification, demonstrating superior performance in both quantitative metrics and expert surgeon evaluations across diverse surgical scenarios.

Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

[420] DeSamba: Decoupled Spectral Adaptive Framework for 3D Multi-Sequence MRI Lesion Classification

Dezhen Wang, Sheng Miao, Rongxin Chai, Jiufa Cui

Main category: eess.IV

TL;DR: DeSamba is a novel framework that combines decoupled representation learning and spectral adaptive fusion to improve 3D lesion classification from multi-sequence MRI data, achieving superior performance on spinal metastasis and spondylitis datasets.

Details

Motivation: Effectively integrating multi-sequence MRI data for robust 3D lesion classification remains challenging, as MRI sequences provide rich spatial and frequency domain information that needs to be properly fused for accurate medical diagnosis.

Method: DeSamba introduces two key components: (1) Decoupled Representation Learning Module (DRLM) that decouples features from different MRI sequences through self-reconstruction and cross-reconstruction, and (2) Spectral Adaptive Modulation Block (SAMB) within SAMNet that enables dynamic fusion of spectral and spatial information based on lesion characteristics.

Result: On spinal metastasis dataset (n=1,448): 62.10% Top-1 accuracy, 63.62% F1-score, 87.71% AUC, and 93.55% Top-3 accuracy on external validation. On spondylitis dataset (n=251): 70.00%/64.52% accuracy and 74.75/73.88 AUC on internal/external validation. Ablation studies show over 10% relative improvement compared to baseline.

Conclusion: DeSamba demonstrates superior performance over state-of-the-art baselines and shows potential as a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging, with both DRLM and SAMB components contributing significantly to overall performance.

Abstract: Magnetic Resonance Imaging (MRI) sequences provide rich spatial and frequency domain information, which is crucial for accurate lesion classification in medical imaging. However, effectively integrating multi-sequence MRI data for robust 3D lesion classification remains a challenge. In this paper, we propose DeSamba (Decoupled Spectral Adaptive Network and Mamba-Based Model), a novel framework designed to extract decoupled representations and adaptively fuse spatial and spectral features for lesion classification. DeSamba introduces a Decoupled Representation Learning Module (DRLM) that decouples features from different MRI sequences through self-reconstruction and cross-reconstruction, and a Spectral Adaptive Modulation Block (SAMB) within the proposed SAMNet, enabling dynamic fusion of spectral and spatial information based on lesion characteristics. We evaluate DeSamba on two clinically relevant 3D datasets. On a six-class spinal metastasis dataset (n=1,448), DeSamba achieves 62.10% Top-1 accuracy, 63.62% F1-score, 87.71% AUC, and 93.55% Top-3 accuracy on an external validation set (n=372), outperforming all state-of-the-art (SOTA) baselines. On a spondylitis dataset (n=251) involving a challenging binary classification task, DeSamba achieves 70.00%/64.52% accuracy and 74.75/73.88 AUC on internal and external validation sets, respectively. Ablation studies demonstrate that both DRLM and SAMB significantly contribute to overall performance, with over 10% relative improvement compared to the baseline. Our results highlight the potential of DeSamba as a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging.

Today’s Research Highlights

Table of Contents

cs.CL

[1] eSapiens’s DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

[2] Adversarial Demonstration Learning for Low-resource NER Using Dual Similarity

[3] Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

[4] Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

[5] Learning without training: The implicit dynamics of in-context learning

[6] Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback

[7] mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages

[8] AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

[9] Step-Audio 2 Technical Report

[10] Deep Researcher with Test-Time Diffusion

[11] The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

[12] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

[13] Efficient Compositional Multi-tasking for On-device Large Language Models

[14] BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset

[15] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

[16] Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

[17] WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability

[18] Towards Compute-Optimal Many-Shot In-Context Learning

[19] Adaptive Graph Pruning for Multi-Agent Communication

[20] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

[21] Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

[22] iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss

[23] Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

[24] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

[25] SpeLLM: Character-Level Multi-Head Decoding

[26] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

[27] GG-BBQ: German Gender Bias Benchmark for Question Answering

[28] PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

[29] Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch

[30] Towards Enforcing Company Policy Adherence in Agentic Workflows

[31] ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

[32] Combining Language and Topic Models for Hierarchical Text Classification

[33] The Ever-Evolving Science Exam

[34] Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness

[35] Learning Text Styles: A Study on Transfer, Attribution, and Verification

[36] Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

[37] Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

[38] Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

[39] P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

[40] Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

[41] PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

[42] Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM

[43] Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

[44] RAVine: Reality-Aligned Evaluation for Agentic Search

[45] Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals

[46] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

[47] Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

[48] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

[49] LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

[50] MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

[51] Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing

[52] Erasing Conceptual Knowledge from Language Models

[53] Data Processing for the OpenGPT-X Model Family

[54] Atomic Calibration of LLMs in Long-Form Generations

[55] Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

[56] Universal Model Routing for Efficient LLM Inference

[57] Reasoning Does Not Necessarily Improve Role-Playing Ability

[58] MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

[59] LLMs syntactically adapt their language use to their conversational partner

[60] SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior

[61] Synthetic Data Generation Using Large Language Models: Advances in Text and Code

[62] Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation

[63] Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

[64] A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

[65] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

[66] HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

[67] Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

[68] Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

[69] Self-Correcting Code Generation Using Small Language Models

[70] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

[71] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

[72] Continuously Updating Digital Twins using Large Language Models

[73] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

[74] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

[75] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

[76] Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

[77] A Survey of Deep Learning for Geometry Problem Solving